 So, welcome everyone to the second lecture of 544, the course on the principles of database systems. So, let me see. Let me start with the announcements. I have a couple of announcements, and I need to switch between slides, which is not that easy to bear with me here. First is that, actually, let me start by asking you, how many people have done significant progress on the first homework? About half, and I don't see well at Microsoft. Can I see a show of hands? How many people have almost finished the first homework at Microsoft? Ah, there are the procrastinators. We have about half of the people in this room have finished this significant progress on the first homework. Let me ask in the converse, how many people have not yet logged in or not gotten the data yet? Okay, please do this. If there are any problems, then you need to contact Jessica for the account. How many people are using their own servers? How many people do not rely on the SQL Server? Only one person. Okay. So, that's also an option. I know that some people in this class, they know SQL better than what you need to know in order to answer this homework. On the other hand, I hope you enjoy the existential universal quantifiers and also some of the group eyes. I know some of you are using very complicated, how should I call it, very advanced SQL constructs to answer some of the queries. I did not have those in mind when I designed the queries many years ago, but if you want to use advanced and rarely used SQL constructs, then feel free to do this. As long as they work on SQL Server, and the reason why they have to work on SQL Server is because this is how Jessica is going to create them. She's going to create them on SQL Server, and probably she won't have time to look into it if they don't work. So, that's the first homework. The second homework is just a theoretical homework. Depending on where you stand, you might find it much easier or actually harder, depending on your own background. It is posted and you have to turn it in in two weeks from yesterday, October 19th. You will be asked to do some ER diagrams, which are really easy for everybody, and to solve some relatively simple problems on decomposing adulation in boycott normal form. Both of these topics, I will cover both of these topics today in today's lecture. There is one fun problem on this homework. You're giving a data set, which is single bit as an Excel spreadsheet, but it's actually a text file. You're being handed it, it has about 1,000 records, and you have to reverse engineer the functional dependencies and to normalize this data set. I will tell you a little bit during the lecture what this means. For that, you need to use a database system. You have two options. Either you SQL Server and now you have your own database, which whose name is your email address. There, you can create a database and you can play with it, and you need this for the second homework, or you can already move to Postgres. You can install Postgres on your own machine and start using it. You can also use Postgres in the labs. Is anyone planning to use Postgres on the machines in the lab? You'd like to use this. Jessica, are the instructions up to date efficiently? This email me if I didn't want to go back. Okay. We have Postgres running there, and you're welcome to use it. We have a choice between these two options. Both are good for what comes next in this class. You also have to read from the book some of these instructions, some of these commands for SQL, how to create tables, how to insert, how to modify topics in tables. Another announcement, somebody asked me to have regular office hours, so I decided to schedule them for Thursdays, five to six. Any problems with this? On Wednesdays, sir. Wednesday. Easier. We can. Usually before class, I want to focus on the class, but I see the point. It makes more sense to have some Wednesdays. Okay. So there is this proposal to move this Wednesday. Five to six. Okay. Jessica, can you update? Thanks. Okay. Last time, we did not discuss nulls. I hope everybody knows what a null is. It's just the value that you can store in an attribute whenever you don't have the value for that attribute. When you first create a table, is a create table statement, this is when you can decide for each attribute whether you allow null values or you don't allow null values. It's sometimes an important decision. Can you give me one example of a kind of attribute where you never want to allow null values? A key. A key. Actually, the system will reject an attribute. That's a key. We reject the record if the value of the key is null. Did you have a question here? Me? Yeah. No. Okay. So that's about nulls. Now, the interesting thing about nulls that I need to show you is this. Let's look at this query here. It's doing a group i. What it tries to do, it tries to join two tables, product and purchase. There is a foreign key. Every purchase refers to a product name, which is a foreign key in the name. The goal of this query is to compute for every month. It wants to compute for September the total number of sales for each product. For each product we want to know, for each product name, we want to know how many items of that product did we sell in September. I claim there is something wrong with this query. What's wrong? Yes. You aren't going to get any rows for those products that had no purchase system in September. Exactly. We will not get any answers for the products that did not sell in September. How can we fix this? Yes. Well, you have to first decide that this is actually a problem for you, right? Yes. First I've decided that's a real problem. That's correct. I didn't make this point. But I think the scenario here is that your boss asks for this list. You don't want to have missing products from the list you sent to your boss. Yes. That's a solution. We use an outer join. What is an outer join? Yes. It's a join where it returns nulls if there isn't a matching record in the second table. Exactly. In a regular join, every record from one table must match the record from the other table before we can output it or pass it, but that's commutative. Every record in the other table must match with the record with here before we output this pair. In a left outer join, the left table which in your case sits here, a record from the left table will be included in the output anyway, even if it doesn't have a matching record on the right. If it has matching records, then it will be output once for each of the matching records. But if there is no matching record, then we output this record anyway. But what we do with the missing fields, with the fields that were supposed to come from the other table. In null? There we put null. That's exactly the left outer join. On the slides, we had an example. This is a syntax. If you want to do a left outer join between product and purchase, then you need to use this particular syntax. It's called left outer join, and then the join condition is not in the where clause, but it is in a new clause called on. Yes. You need to do something like sum and is null trickery to make the numbers line up, right? Because I guess a count star on something where you have a null product country, sorry, a sailing tree, is going to count as one, right? You're right. You're right. So this is something I didn't pay too much attention to. But right now, I want to illustrate the outer join. This is a left outer join. This is a semantics. Look at the two tables at the top. Here is product, and here is purchase. So if you look in product, one click didn't sell. We never sold one click. However, when you do the left outer join, one click will appear with null. That's a semantics. On the other hand, camera occurs twice, because camera was sold twice. So it's a strange semantics. Because of this bizarre semantics, they are difficult to optimize. Left outer join and right outer join. They are no longer associative. It has this bizarre semantics that the tuple may occur, maybe replicated many times. Or it may be included exactly once if there is no matching tuple on the other side. Good. So returning to our aggregate query, in order to include all tuples that did not sell in September, we need to do something like this. We need to replace the join with the left outer join. But then, as your colleague noticed, we need to be careful what we count on, what we count. And count has, so once you manipulate data with nulls, you need to be aware that every single SQL construct has its own semantics of how it handles nulls. Usually, the semantics is intuitive. You can just guess what the semantics is. But in case of doubt, you have to read the documentation. You have to read the SQL standard. What do you think would count do? What does count of something return if some attributes have nulls? How would you describe the semantics of count? Yes? Or you need values of whatever you're passing into count? So count normally counts how many values you're passing into count, right? But suppose we do count store, like here. Suppose one record has a store value which is null. Is that counted? Well, there is no way to know. You have to read the documentation. But I can tell you that's not going to be counted. So count of an attribute skips over the values of that attribute that are null. And that's good. That's exactly what we want. Because after we do the outer join, we know that for the products where there was no purchase, store will be null. And for those products, we just get zero. And that's exactly the semantics that we want. What happens if we put here count of star? Last time I argued that count of star and count of an attribute are the same thing. Are they still the same thing? No. Because if we put count star, then all the attributes are included when the system looks for null. And if one of them is not null, then it will be counted. And of course, some will be not null because we are including what star includes all the attributes. And at least those coming from the left table are not null. OK, so this is how we use nulls and outer joints. And some of the questions on the homework required to do exactly this construction. Any questions about nulls? Please also read the three-valued logic, which is on a few slides earlier. And you can simply get it quickly by reading two slides. So now for today, I want to start the lecture with a discussion of this blog. How many people have read my Stormbreaker's blog? OK, like a good majority, I would say. So what he describes in this blog is like a reply, it's like an answer, to a movement called the NoSQL movement. The NoSQL movement says all relational databases are crap. We should use instead NoSQL databases. What does it mean? What is a NoSQL database? Yes? I thought the idea was that if you're not using these SQL features of a database, then why make it a SQL database in the first place? For instance, if you just need a tuple store, if you want to scale very high, tuple stores are going to be faster than the SQL database. So using that is going to fit your performance on any given set of hardware. We'll get there. Let's discuss this. But let's first define what a NoSQL database is. And he describes two types of databases, do you remember? Yes? It's usually a key value store. It's usually a key value store, which means you're storing objects that consist of a key, a single key, and then some value, which is a payload. It can be multiple attributes. It can be a blob, means binary large object. It can be some payload. He also describes the second. Document store. A document store. And to be honest, I didn't quite understand the subtle difference. Can somebody explain to me the subtle difference between key value stores and document stores? Do you know the difference? Yeah? I think there's a lot of lines that it's key value pairs associated with a document object so that you can look it up across a couple different attributes and get to a single document object. I see. So a document store would be like a key value store. But one of the attributes in the value is a document itself, which is kind of a more special kind of data item. So what these database systems do, they give up tables and they give up SQL. They only allow it to store and access objects by the key value. And then the blog is about a discussion of where that extra performance comes from and of why you are paying the cost of this lost performance if you're using a SQL standard relational database system and what you can do about this. That is the essence of the blog. But early on, he distinguishes. He makes it very clear. He makes a very clear point about what kind of workload these NoSQL databases can support. And he defines two kinds of workloads for relational databases. I call them here the mystery, x and mystery, y. What are the two kinds of workloads in a relational database? What are x and y? Yes? Reading and writing. Particularly in the key value stores, very high volume inserts and very high volume but unrelated periods. So reading and writing are indeed two different operations, but I wouldn't call them workloads. Workloads, it means it's something at a higher level. It means a broader definition of how you access the data. Transactions. Yes, they are called OLTP. Let me write some. Let me use advanced technology here. Oops. This is good. They are called OLTP, online transaction processing and decision support. What do we mean when we say that a database is used for OLTP for an OLTP workload? Online transaction processing. What kind of? We have lots of instance processing of the records in the database. Decision support to just read and process what you want. So I don't know if people at Microsoft, did you hear? Can you hear when people talk in the room? Not really, so I need to repeat what he said. He said that OLTP means mostly insert an update, and decision support means lots of reads. And I think it goes in the right direction. I would say that OLTP means lots of updates indeed, lots of insertions, deletions, and modifications, and also point reads. Reading a single record, or looking up a record by the key value, or by some other indexed attribute. They are called online transaction processing, because the typical example is out of bank accounts. If you think about bank accounts, what you need to do? Well, a customer comes. He wants to withdraw money, so you need to access just one record, his account, into one update. Maybe two updates. You need to move money from one place to another. So the workload consists of a large collection of simple modifications and simple accesses to the data. What is decision support? Complex queries. Complex queries. It is a query and a decision support that will not touch a single record or three records. The decision support workload means think about complex group files with many joints. And there is no way you can get around by simply by hoping you're going to use an index and quickly get to the data that the query asks. Because a query is probably a complicated group file and then some aggregation and lots of joints in the meantime in between. So obviously, no SQL databases, these key-valued pairs, only support one of these workloads. Which ones? OTP. OTP. This is where they are. So if you ever consider having decisions to support queries on your database, forget key-value pairs. Then you must go with a relational database system or with some specialized database system, which is called the data warehouse, which is specialized for these kind of queries. So then, he goes a little bit deeper and tries to explain. So he agrees, key-value pairs, no SQL databases, which are key-value stores, they are significantly more efficient than relational database systems. But then he goes deeper and tries to analyze why. They are not because of a bad design choice. But their design is motivated by certain needs, by certain functionalities that he analyzes in this paper. And I'd like to repeat a little bit. What is the first thing that he says there? He says something about stored procedures. Can you explain to me what was he trying to say with stored procedures? We're seeing that the bottleneck is the transfer of the data between the database and the client. Put everything in a stored procedure, cut off the bottleneck and everything. Exactly. The first bottleneck in a relational database system is that when your application needs to get just one record, think about the OTP framework. When it needs to get one single record, that record is not handy. It's not in the same address space, it's not in the same server, in the same machine, in the same address space as your application. Instead, it needs to contact the database server. Remember the client-server architecture? The application runs on the client. Now it needs to contact the server. This is a TCP IP connection. It's a roundtrip. In addition to all the overhead of spawning a thread on the server, or maybe a process, there are two ways to implement the relational database systems. Laws of overhead, in addition to the TCP IP connection, before it gets that record back. Yes, a question here. Will this be like the latency problem more than a straight throughput problem? It's definitely a latency problem, absolutely. So if it were a throughput, I don't have numbers. I usually don't work at such low-level systems. But throughput is pretty good. But remember in OTP, it's a latency that gives you, because that's the definition of this workload. Every access to the data retrieves a small piece of data. So the latency really gives you. So this is an important idea to keep in mind. Because of the client-server architecture, the relational database systems, they suffer a huge penalty. Now we don't have to have a client-server architecture. Do you know of database systems that are not based on the client-server architecture? Yes? SQLite 3? SQLite. Yeah. How many people have worked or heard about SQLite? SQLite. About three here. And there's Microsoft and not many kinds of SQLite. It's a fun system. What it does is that the query processor sits in the same address space as your application. It's essentially a library. It's great, no? It's much more efficient. Why doesn't SQL server use it? Yes? It scales very badly. In what sense? Well, you don't have replication across many instances. It has to be with one instance of the running application. Absolutely. You can't have replication. But even if you think about our IMDB movie database, that's not replicated. It's a single database. How would that work if we were to use SQLite instead of the IMDB? Yes? Well, everyone would have to have a copy of the database on their local server. Everybody must have a copy of the database, which is fine sometimes. But what happens if you do updates? What happens if you have, if your database is about bank accounts? Now you have a whole new set of problems, right? Because now all these copies of the bank account, they need to be synchronized. It's a nightmare. So keeping the data in one place in a single server, it's a major design decision of the relational database systems. And it has to be that way for certain reasons. Yes? So in the case of IMDB, does that have to be synchronized? Well, in IMDB, sure, we could give each of you a copy because we never update it. Assume that we did. Assuming that we did, then it would be a nightmare. Because if you update your copy, then that update needs to be replicated to all the other copies. And what happens if there are conflicts? And what happens if in the meantime somebody reads the record that you try to update and if you read a still version? It's a nightmare. And I'll get to that in a second. Yes? And for a lot of that, is the big security concern? That actually struck me because the main reason why Mike Stonebreaker said that database times this client server architecture, he said it's for security. Well, he has a lot of experience. I suppose maybe he has a point here. But concurrency control strikes me as another major reason. The best way in which you can handle concurrent access to the databases if you have a single copy of the data in one place. So we have to live with this big delay, with a roundtrip, with a slow TCP IP connection. And we're starting a thread whenever we contact the database. How can you do better? Did you have a? How can we reduce the impact of this connection cost? Think about we are still within this framework, you can't say, I'm going to do group by queries because they touch a lot of data and then the throughput will be good. Yes? Batch processing? Batch processing, if you're running in a client server architecture, then you will still have the same problem. On the other hand, batch processing might not be an option for certain applications. Yes? Separate the DSS database from OTP? We definitely separate the DSS decision support database from OTP, that's a good option. But the question is, how do we speed up the OTP application? Move the load from DSS? But that doesn't solve the latency problem. Yes? Caching? Caching. Caching. Caching. It's not that easy. I don't know how caching works between the server and the client. The client is like your teller machine. And every time a customer comes, he needs to access a different account. So caching doesn't quite help. What did you want to say? Oh, no, I was just saying that, normally, OTP and decision support, anyways, you never have them together in the same environment. So it's really good stuff. Yeah, we should put everything on this. It's a good design decision to keep the OTP database that serves as an OTP database, separate from the database that serves as a decision support database. But then you wonder, how can these be synchronized? How are they synchronized, by the way? How are these? Use a replica normally to the decision support. Sorry? You use a replica or use a load balancing. You keep one of the practitioners. You keep a replica practitioner just for decision support. Right. So they are replicated. Yes? Actually, there's a good I think that decision support systems are shaped very differently than OTP as well. So in a lot of instances, we've used detailed processes to actually transform things out of OTP. Right, but you need somewhere to meet from the first place. That's right. You first use it to get the data warehouse used to use the work. OK. Totally agree. So just to give you an update that we have here a lively discussion about how this is done. So normally, what is done is that the decision support is usually stored in a data warehouse. I'm going to write here DW, data warehouse. It's often a different schema, different organization, and often a different database system that runs a decision support system. And the data is transferred from the OTP database system to the data warehouse on a regular basis, usually at night or on a weekly basis. And it's never completely up to date. And that's OK. And I don't know if I do my personal banking at midnight. It's often the system says, sorry, we are updating. When I suppose to do my personal banking. But that's life. What you see there is not the OTP. There's no data from their production database from their OTP database. But it's a data from a copy. Good. But we still didn't answer the major question. We are forget about decision support. Let me erase it. That's gone. We only have the OTP database. And we have a client. And this client needs to connect, often, to the OTP database with these point accesses. Give me this record, update that record, check if the balance is bigger than so and so. And if so, then do the transfer. Several connections just for a minor transfer. How can we improve this? Yes. Why would you have several connections? Usually, you'd maintain one connection. First, these things. But every time you need to do one of these actions, you need to send a SQL query. Yes, but you keep the same connection. That's just basic. Right. So there are tricks you can do as a connection level. You can keep the connection open. This means a single TCP connection. But you're still paying a relatively high round trip overhead. And the solution is called stored procedures. Very little stored procedures. There are essentially programs that are written in some arcane language. Actually, I've never used it. I don't know it. But they are stored on the database system. They live in that address space. So the idea is that you would move some of the application logic from the address space of your client. You move it to the OLTP system. A typical example is think about a transfer from one account to the other. You know the amount of money to be transferred. You know the account number of the first account and the account number of the second account. You know all this. So now you need to do a simple application logic. Read the first account. Check if the amount there is greater than what you need. Update it with a difference. Read the second account. Update it with a sum. That's it. Five accesses to the database or five steps. Instead of having five JDBC or ODBC connection between the client and the server, you store this piece of code on the server. And then you call the stored procedure only once. And it will execute this application logic as a database server. And that is a significant saving. Now, it's a good programming practice when you write applications to move some of your application logic on the server. OK, but this only brings it so far. At some point, the database server will be swamped. So what does Mike Stonebreaker suggest that you can do if the database gets swamped? Yes? Sharding says split up the key space across multiple servers. Sharding, he mentioned this term. Sharding means horizontal partitioning, something we will talk about next time, partitioning. But the idea is very simple. Think about a table. It's long, a very, very big table. You partition this horizontally. A typical example is customers in Seattle, they will sit in this partition. Customers in Bellevue, they will sit in that partition. And customers in New York, they will be in a different partition, in a different table. Now, sharding goes beyond that. In addition to the partition, sharding also means you store these partitions on different servers. And you often partition other tables as well in order to isolate as much as possible access to just one single partition. That's the idea in sharding. Now, database systems, apparently they don't support this idea of sharding very well. But modern ones like Green Plum, and he mentioned a couple of others, which I don't know, they do support sharding pretty well. Any comments so far? Did anybody use sharding in Seattle? On what system? Some stuff at Amazon in a previous job. Basically started based on data that the client already had. So instead of going by location, which you don't know, you go by something, you already have an ID number. OK, so he mentioned an application where he did sharding, but not by location, but by an ID number by different criteria. OK, so that's one way in which you can improve your database performance. It won't get too far. And there is a problem here, because traditional database systems like MySQL, they don't support sharding. You have to implement this in your own application. I don't know about SQL Server that supports sharding. With some work, but yes. With some work. OK, so depending on how much you want to invest. However, these new startups, and I know about Green Plum, because it has some connections to Berkeley, they do support sharding natively. OK, but then he goes even deeper. And he says that if you look carefully at where the time goes in these OLTP systems, where does a server spend its time? After you do the sharding, after you do the store procedure thing, why does such a highly optimized database server still perform worse than a key value store? Where does this time go? Logging. He mentions four things. And let me see how I can do this. I'm going to raise everything. First thing is logging. Then he talks about locking as well. Locking. Dessert. It's called latching. And the fourth is the buffer management. Logging. What is logging? It's something we are going to study in detail two weeks from now. And yes? Writing everything that you're doing in the database so that if something gets corrupted or crashes, you can recover and replant. Exactly. That's a beautiful, beautiful definition. Writing something in a specialized file called a log, such that if the system crashes, you should be able to recover the data when you restart the system. It's a major source of inefficiency. In order for the logging to work, you must write a head. It's called the write a head log. You can't postpone that right. You must force it to disk. When the time comes to do your logging action, you must force it to disk. And this accounts for, according to some research paper that he cites, accounts for 25% of the time spent by the server. Quite insightful. Logging. What's logging? And actually, let me back right here. If you're using a key value store that doesn't do logging, Vingo, your performance improved by 25% just because of that. Do key value stores use logging? As far as you know, does yours do logging? Because Andrew does most of its initial stuff in memory, so it doesn't force writes to disk immediately. So if you lose the server that you inserted that into, you can lose some data. So somebody in this room has experience with Cassandra and a key value store. And he said that by default, it does not do logging. But of course, then you don't get asset properties if you don't have logging. Logging. What is logging? That's easier. Log the record before updating it. Read log or write log? Read log or write log whenever you're updating or even when you read. If you read a record and you want to read several records, you need to log them so you don't allow others to update them because then you get an inconsistent view. Same thing. If you don't care about concurrency control, then sure, you can get rid of logging. Now, latching is a more technical term. It has to deal with how you log indexes and the distinction why do we have different techniques that's pretty subtle and I hope we'll get to this when we talk about the recovery manager. But it's the same spirit as logging. It's to ensure concurrency control. These two together are almost indistinguishable. And the last thing is a buffer manager. What is a buffer? It's memory where you keep copies. It's a cache. The buffer is exactly a cache. It's a cache of pages from disk. And the reason why we need a buffer is because the database is on disk. If it were a main memory database, you don't need a buffer. Yes? Well, you put more than just caches and they just do stuff like query plans and temp table and stuff like that in a main memory database. We can't put the query plan anywhere else. But there is no reason to. So the purpose of the buffer is precisely to be a cache between the disk and the database system. If you were to implement it, I don't have examples of main memory databases, but you can imagine that they get rid of the buffer. But let's discuss exactly what we mean by the buffer. Where does the time go? What happens? How can 25% of the time go in managing the buffer? What exactly goes wrong here? The buffer we haven't discussed yet. We're going to discuss it two weeks from now. It consists of pages, of blocks. They are called blocks. And they have fixed lengths. And the data here is organized badly because it's organized as it has to be when you store it on disk. So if you need to access a record, you need to do a little bit of parsing, essentially. You need to find that record and identify its attributes. It's the records out of variable lengths. And you need to do some arithmetic to identify where the record is and where its attributes are. That adds to the overhead. Now, if you have a main memory database system, this can go away. Now, your records are standard Java objects, or C++ objects. And you don't need to do this parsing at runtime in order to access your objects. So these are, I found this quite interesting, quite insightful. This is where your time goes. If you turn all these things off, then, yeah, you get to the same performance as a key value store. But in many applications, you need the functionality that justified one or all four of these features. One of the other things that he talks about is a single record transaction. Now, did we discuss transactions last time? We have three lectures for transactions, so don't worry. We'll get the point. But we mentioned briefly what the transaction is. It's a set of operations that are executed atomically. Now, if a transaction is a single record, it means that you can only update a single record. And that's what the transaction can guarantee. What happens if you want to transfer money from one account to another? Can you do this with a single record transaction? That's a typical example where you need to make sure that both accounts are updated in a consistent way. You can't update one account and then say, well, now if the system crashes, I'm OK. And then you update the second account and say, if the system crashes, now I'm OK. Now, you must ensure that both accounts are updated in an atomic way. And the problem with these key value stores is they support only single record transactions. And therefore, they will not support the more complicated application logic. Any other thoughts about this blog? Yes? He talked about sharding and talks about one database system that's going to be supporting it better. And we have a fellow here who suggests that it's painful to do on MS SQL. And so I just say that the no SQL things seem to still have a very reasonable place for something that you need the parallelism, because you just can't do that with the main database system now. So it's certainly not the same. And certainly, you wouldn't apply the same workload. But I don't know, he seemed to be kind of dissenting of no SQL in a way that I thought was a little too harsh, because the main system still supports some of the stuff that you need to get the load down. Yeah, so the point that your colleague made is that currently there are no database systems available. And that supports the sharding that you need to do for in order to scale up to large volumes of data that supports sharding out of the box. He does mention some startups that offer such database systems, but come on, they're not widely available. So your colleague has this point that there is a clear role for the no SQL databases out here to scale up to much larger collections of data than standard database systems. Well, this is Mike Stonebreaker. He wants to be controversial. And maybe we can discuss other of his controversial statements. I didn't see, I missed a controversial tone in this vlog, perhaps, because I was expecting more controversial statements. But anyway, I suppose we all get the idea. No SQL databases offer very limited functionality if you know for sure your application can, when only needs that functionality, then you can go with them. If you need anything extra, then start with a relational database system. And any more comments? OK, good. So this took me much longer than I thought. But I think it was a great discussion. How do I go back? So for today, we will discuss two parts of the conceptual design activity. How to design entity relationship diagrams and how to convert them into relations. And then we will discuss traditional normalization theory, how to make the design how to make sure that the design avoids anomalies. So let me go a little bit faster over the database design. The idea is here that when you first start a project that needs a database, we need to figure out what goes into that database. What are the objects? And another terminology for the objects are entities. That is the old terminology for objects and square stack position and the database word entities. What are their attributes? And how do they relate to each other? Then you need to document this somehow before you create tables. You need to be able to document this. And this is done in some diagrams. They are called entity relationship diagrams. And that's what I'm going to show you now. These diagrams, they talk about three things. They talk about entities. These are objects. Collections of similar entities are called entity sets. Think of them as classes. And this is one thing, entity sets. Then we have attributes. And then we have relationships. So let me start with a simple example. Suppose you need to design a database that maintains information about products, about the companies that make these products and about people. These people, they can be either employees at these companies or maybe they are customers of these products. They serve a double role. And these are the three types of objects. These are the three classes or the three entity sets that you need to store. The next thing you need to worry about the attributes. What are their attributes? Well, you know how to get there. You just need to find out what attributes each object needs to have. And you add them. And now your design looks like this. You need to decide for every entity set, you need to decide which attribute is a key. In this formula of entity relationship diagrams, every entity set must have a key. And you underline it. And sometimes if I forget a key, I remember it actually has to be there. And finally, you need to design the relationships between the entity sets. For example, makes is a relationship that tells us which product was made by what company. So where are the tables here? How many tables would you have to create and SQL to implement this entity relationship diagram? Well, let's take it slowly. How would you implement product and SQL? Table with those three columns. Exactly. Create a table called product. And it will have three columns, which are these three attributes. Obviously, the same thing for company, right? You create a table called company with two attributes. But now makes is supposed to tell us which product is made by what company. Yes. So can a product be made by another one company then? Let's say no. Let's suppose a same product can be made. It's like a commodity product flower. It's made by several companies. So how can you implement makes? A linked table product company, for instance, that says who makes which product. You have a linked table. A linked table is called product company, for instance. And it's got one used to the key linked company product. Exactly. Another table. He called it a linked table. But I would call it a table. It's going to be a third table for makes. Now, what are its attributes? How many attributes do we need in this table? Two. One is a product ID. And the other is name. You've seen this, right? If you did the homework, you definitely saw this. What was the name of the relationship cost? OK. So this is an entity relationship diagram. Now, let's look a little bit deeper. So I already mentioned this. Every entity set must have a key. Now, what happens if we underline two attributes? What does it mean? It's a composite key. What happens if you want to have two separate keys? If you want to be able to identify an entity set either by this attribute or by that attribute? Both can serve as a key. You can't do this. An entity relationship and in SQL as well. You can only have a single keeper table. You can say it's unique. Or you can create an index. But if you create an index, it doesn't mean it's unique. OK. So these are the entity sets. Let's look a little bit deeper at relationships. Relationships are just massive, massive relations. Now, a relation, if you remember from algebra or from set theory is defined as follows. If you take two sets, A and B, then their Cartesian product, which is this, represents a set of pairs, one element from A and one element from B. That's a Cartesian product. A relation is just a subset of this Cartesian product. So this is R. It's any subset. And we often represent this with edges. So we put A on the left and we put B on the right. And then every time a pair is in the relation R, we add an edge, like here. This is a relation. Relationships in ER diagrams are exactly relations, are exactly mathematical relations. For example, makes. It's just a set of pairs of product company pairs. That's how you should think about this. Now, here is an interesting definition. A relation is called one to one. If every element on the left connects to at most one element on the right, and every element on the right connects to at most one element on the left, then it's called one to one. It's called many one, if every element on the right connects to at most one, what did I say? And every element on the left connects to at most one element on the right. And it's called many many, if none of the above happens. And of course, we can also have one many. In ER diagrams, we represent these relations with arrows. We add these arrows. It's very handy in expressing certain constraints in the data. Let me show you an example. Before the example, I should point out that people use different conventions to represent many to one or one to one relationships. And also the annotations for the entity settings for the relationships, the rectangles and the diamonds. It's interesting, they are standard in textbooks. Textbooks use rectangles and diamonds, but none of the commercial systems use rectangles and diamonds. They use different representations. So then don't get too much attached to these notations. They are standard in academia and in textbooks, but feel free to recognize ER diagrams in any other format. In particular, our book decided to move the arrow from here to the arrow here. I think it makes more sense the way the book does it. But most books just place the arrow here and I'm going to use that notation on the slide. So let's see how we can use many many relationships. Now I made makes a many to one relationship. What does it mean? Yes, but you but you said it's correct, but it's not enlightening. I don't get it, yes. You don't need the link table anymore. That's true, but it's deeper than I want. You can add a column to the person table. Yeah, that's what he said. But what does this mean? What am I saying about my products and my companies? One product can be made only by one company. Exactly. If you look at the product, that product can be made only by one company. That's what I'm saying. So no more flour and butter and milk. If you have milk, milk is probably made by several farms, by several companies. No more. Now we have a different database in which a product is made. I think about Apple and iPod and Netbook. Everything is made by a single company. You're not happy. Did you have a question? Don't we still need the mix table because a company could make multiple products? Sorry? A company make more than one product? A company can... So good question. Let's turn this into a question. In this design, can a company make more than one product? Yes. It's many to one. Yes, because it's many to one. So having a foreign key in the company... So having a foreign key in the company to product would not work. But what would work? A foreign key to company in product. If you put a foreign key in product to company. I have more detailed slides about this. What about this arrow? What does it say? A person can be employed by only one company. A person can be employed by only one company. If we don't put that arrow, it means that a person may be consulted for multiple companies or works for multiple companies. But if we put the arrow, then we insist that a person works for only one company. What about... Can we store customers in this table that don't work for any company? In this design, I said table, but the correct term is design. Can you have a person that you want to add to the data but you want to insert into the database because he's a customer? That's not a foreign key. But the question... Don't think about the implementation. Don't think about the table statement yet. Think about the ER diagram. What do you think does the arrow mean? There should be a company for every person. So if a person is a customer who does not have a company, then he cannot be there in the person's name. No, it's actually incorrect. It's the opposite. We can have a person that doesn't work for any company. The arrow means every person must work for zero or one company. And if you get confused, then you can go back to this slide, which I'm going to make clearer. You can see this, for example, here. You don't see it on the left, but you see it on the right. B has no matching element on the other slide. So one question. Are we missing one diagram here? We have many to one, but not one to many. Yeah. We are missing the symmetric to one to many. Okay. So I think we get a better intuition of the arrows. Multi-way relationships. What does purchase mean in this case? It associates a product, a store, a person, and it has a date attribute. What does it mean? Purchase consists of a triple of a product that was purchased, a store where it was sold, and a person who bought it. So if you were to implement purchase in SQL, you need to create a table with how many attributes and what kind of attributes? Four. Yes? Three views and one call. Exactly. We need four attributes. One, the key from products, the key from persons, the key from store, and the date when that purchase happened. Okay. So this is multi-way relationship. Now, we don't need multi-way relationships. We can always reify them. That's the term that sometimes used in graph theory. We can replace purchase with a new entity set. And then it looks like this. Now we have promoted purchase from what it was a relationship. We have promoted it to an entity set. But now we need to record for every purchase. The product, the store, and the person where that purchase happened. Do you follow this? Because now I have a subtle question. This design misses arrows. Whenever we convert from a multi-way relationship to several binary relationships, we need to add arrows. Which arrows do we need to add? Let's look at product off. Should I put the arrow here, the red arrow, or should I put the arrow here, the blue arrow? So think about it this way. A purchase. Is it the case that a purchase has a single product? Or is it the case that a product has a single corresponding purchase? Neither. But what is a purchase? How do you think about a purchase? We just discussed it. It's a single transaction. So I think it goes on for just pointing at products. So which color? The blue one. Yeah, that's the answer. Oops. Let me put the blue back. Think about it this way. A purchase consists, it's a record. It's a topple that has the product ID, the store ID, and the person's social security number. And it has a date. That is what a purchase is. So clearly if you look at the purchase, there is a unique product. We should just read. It's just right here. So a purchase corresponds to a unique product. OK. Some more fun thoughts about design. What's wrong with the first design? A person can only buy only one product. That would increase the savings rate a lot. Right? So we probably don't want this arrow. What's wrong about the second design? You can be president of several countries while the country has several presidents. Yeah. It works if you're a board of directors or something. You can be president of, but this is really countries. You can be on several boards, and the boards contain several people. OK. So we have a creative, and we have from this room a creative proposal for how to make sense out of this design. So instead of countries think about companies, and now instead of president, that's a CEO. So yeah, a person can be a president of several companies, and the company might have several presidents. But when it comes to countries, that's not true. How would you design this better? Where would you put arrows? Both ends. Both ends. That's what I put. A country can have only one president, and the president is the president for only one company. However, if you actually think deeper, it's not so clear. One of these arrows is not so clear. So across time countries have multiple presidents? Exactly. If you want a historical database, if this database is a snapshot of today's presidents, then you need both arrows. But if you want a historical database, a country over time has multiple presidents. Yes. So a couple slides back, we applied a one-to-one relationship between two records where it allows for that nullness of one or the other of it. You can have a constraint that says not null for that foreign key. Is there any way to represent that there's got to be one of each of these in the diagram? Right. So let me repeat that question for people in Microsoft. So in SQL, you can always be clear that the foreign key is not null. And you can enforce that every entity from the left has exactly one matching entity on the right. How can you do this in ER diagrams? And you can do this by replacing arrow with this sign that I have it later on with a semicircle. That's a convention. And probably different conventions. You might find different conventions in other places. OK, more fun with design. What's wrong here? It's just our standard purchase between store and product. And we just record, in addition to the date, also the name and address of the person who made the purchase. You don't store what he purchased? We store what he purchased right here. What's wrong with this? Yes? You've got to repeat yourself a lot, because a person can purchase many products, but you're going to re-encode that every time. Exactly. A person may purchase multiple products. If you use this design, then you need to repeat the information about that person over and over again every time he purchases a product. Yes? It may actually make sense to give them the possible historical data. But that's maybe a different discussion. I find a lot of arguments in favor of storing people and persons in a separate table, in a separate relation. Even if you collect them, like I was wrong, when the customer buys a product for the first time, you don't have it in that table. And then, of course, you don't want to turn him away. So you need to insert him on the fly. And maybe then you don't have all the information. So you need to make some decisions. But there is a lot of value of consolidating these customers in a common table. So it's worth to go through the effort and create justice. Person, and now you have a three-way relationship. And last one. What's wrong with this? I really did the right thing. So is there anything wrong? The dates. The dates. Don't overdo it. The dates, they make no sense. They are just strings. It doesn't make sense to have a separate entity set for a separate relation of dates. You look unhappy. You would store them at a separate table. I was like, why would you even do that? That's just evil. But that's the purpose of the slides to make you object. Good. So this was basic ER diagram design. And we touched upon this question. How do we actually map the ER diagram into equal tables? Now I'm going to go a little bit in more detail. But I'm not going to go too slowly because we already got the main idea. So the first task is to convert every entity set into a relation, into a table. And here it is. An entity set called product is converted into a table called product with exactly the attribute that you see there. Really no surprises here. In SQL, the command for creating a table is this. It's create table. And after that, you list the name of the table. And inside, you list the attributes. In this case, there are three attributes. And look how we declare the key. After product ID, we say that's a primary key. Any questions about the create table statement? The next task is to represent relationships. Shipment. So a shipment, we discussed this. A shipment will be implemented by a relation, by a table that has the key from the left, the key from the right, and its own attributes. So here they are. The key from the left, the key from the right, and its own attribute. It's very, very straightforward. So let's see the create table statement. It's right here. This is the key from the right. This is the key from the left. And here is the date. Now, what does references mean? It's a foreign key. References says that name is a foreign key to the shipping company, to the table on the right. But the other two, they also form a foreign key. How do we say that? We can't say it after the attribute, because there are two. So we list them at the end, right here. So we say here that the peer product ID and customer ID they together form a foreign key into orders. Nothing deep here. It's just syntax. Notice that the primary key consists only of the keys. It does not consist of the date. And that is a convention in ER diagrams. A relationship is uniquely defined by the entity that it connects. And they form a key. You do not include the attribute as part of the key. That's a convention in ER diagrams. If you want the attribute as part of the key, then you make that an entity set and create binary relationships. Any questions about this slide? OK. How do we represent this purchase? I think we discussed this, right? So I'm going to skip it. More features in ER diagrams. Classes and subclasses and inheritance. And the example I'm going to use to illustrate inheritance is that of an entity set called product. And two other entity sets called software products and educational products. And the idea is that a software product is going to be a subclass of products. I'm going to have extra attributes. So in ER diagrams, the sign is this triangle where we write is a. So we say that the software product is a product. So what does it exactly mean? If you look at the software product, how many attributes does it have? Four. So it inherits the three attributes of product. And in addition, it has its own attribute called platforms. How do you implement this? Clearly, we create a table for product. But how do you implement software products? Yes? Does each one only has one vector attribute? I'll put them all in one table and have a type column. I see. So you would hack it. What about a very clean design? Think about a database system that doesn't understand nulls. So it doesn't allow you to put any nulls. It's an idealized relational database system that's really based on mathematical logic, no nulls. Yes? Just pick another table that's the software product. The key is the foreign key of the product. And then it just adds platforms to another column. Yes. So we create a new table called software products. And the interesting question is what attributes do we have? And you mentioned a foreign key into product. That would be the name. And what other information do we have? Platforms. Just a platform. That's what we need. Yes? If you could also do something where you model the subclasses as basically additional attributes. And then the product has a specific class ID, which tells the database which attributes are allowed. And then you have an attribute table that lists this off attribute die, attribute die. Sure. You can do this. This is what he proposed. But then if you have a record that is of the wrong class and you need null values, I mean, for every record. It doesn't have null values. It just has lots of records for a given product. Oh, I see. Because you're essentially partitioning the record into separate, partition the table vertically. Yeah, I mean. Once for every attribute. I have a great many varieties of products. That might be a good idea. Yep. Let's actually discuss this. It's an interesting thought. Let me show you the design that I would like everyone to understand and use on the homework. It's like the baseline design. And then we can discuss variations. So the design that I would like you to remember is this. There is one table for a product which contains exactly the attributes that you see here in the entity relationship diagram. And then there is another table for software product that has two interesting things. It has a foreign key. And it has this extra attribute for platform. Let me use screen platforms. And then there is the educational product that also has a foreign key. It has its own special attributes called age group. Does it make sense? It's pretty straightforward. What your colleague proposed is that you can push this idea even further. You could create a separate table for every attribute. That's called vertical partitioning. And sometimes it makes sense. Sometimes it makes sense, especially if the application accesses only one attribute at a time. It makes much less sense if you need to join, if you need to access multiple attributes at the same time. So it's a design choice. And it was a trend in database systems recently when essentially Mike Stonebreaker's group reinvented column-wise organization of databases. And that was another scandal because he didn't invent this. But he essentially reinvented and popularized the idea. And the idea, again, is to take a table with many attributes and store it in separate tables, one for each attribute. And that speeds up certain query workloads. Yes? Same vein. Doesn't this design make it more difficult to do kind of aggregate queries that compare software products and educational products because you have to do design? Absolutely. So this design, it's very elegant. I hope you see it's elegant. But if you need to compare, whenever you have a query over software products that needs to touch both the platform and, let's see, the category, then you need to join. They do not sit in the same table. Yes? Even worse than that, I've been doing kind of decision support kind of stuff that involves the various types of products. And it has limitations on platforms and age groups and products. It gets much more complicated if it's split out this way. OK, so I'll take this criticism. Did we discuss about data independence? We did not discuss this. I'll take this back. So takes this design as the lecturer's recommendation for the design of this database. You can consider others as well. But for the purpose of the second homework, please use this design. This simplifies grading. And we'll come back to the data independence principle. I want to show you a cute application of inheritance. How do you model union types? A union type is essentially the type that consists of the union of two different types. You want to put them together in the same type. And the example here is that we have three entity sets. For instance, we have companies. And every piece of furniture belongs, either to a person or to a company. How do we say this? How do we represent in an ER diagram that every piece of furniture belongs to a person or to a company? Could make a sub-base class owner, furniture. Exactly. That's how we would like to do it. We would like to create an owner entity set that consists of all the persons or companies put together. Now they are owners. They are the same thing. This is a union type. But before we get there, there is a simpler choice. Now we can say, look, the piece of furniture is owned by a person. But that's not enough. We also need to allow this to be owned by a company. So this would be owned too. But this design is somewhat unsatisfactory. What is this missing? No, no. It allows us to represent things that we don't want to represent. It allows multiple owners one each type. Exactly. You can say this chair belongs to this person and to that company. Well, bad design. That's what we... Yeah. You could hack it with a trigger that requires at least a email. Sure. You can hack things away. But before we hack it, let's come up with a cleanest design. And the design that I want to show you is with this union type. So this is what you saw. The union type defines a new entity set called the owner. And owners are all the companies and all the persons. It makes sense. If you're using them in a unified way as owners of furniture, why not create a new entity set that represents all the owners? And then every person is an owner and every company is an owner. And now you can simply say that a piece of furniture is owned by a single owner. Yeah. This is the kind of design that allows us to express exactly what we wanted. Okay. I still have... How many slides do I have until we reach a normalization theory? I think it's a good time to take a break. Okay. So let's take a five-minute break, not longer. And then we still have to cover... I think we should be able to finish tonight covering the boys' code normal form. It's not that hard. So a short five-minute break. Okay. Good time to start. So let's finish your diagrams. I only have a few things to show you, namely constraints and then weak entity sets. So constraints... Constraints are statements. They are essentially logical statements that must hold at all times on a database instance. And we need to be... They are part of the database design. We need to be aware of what they are. And many types of constraints can be represented either in ER diagrams or in SQL or in both. And we will discuss constraints both in ER diagrams. It takes us five minutes and later in SQL. So the constraints are of several types. Key constraints. You know what they are. I'm not going to spend time. Single value constraints. This is the arrow. Referential integrity constraints. This is whether we allow nulls or not. That's in a foreign key. And then more complicated constraints. So let me go quickly. Keys. We have seen key constraints. I have nothing else to add. We discussed keys. Single value constraints are the arrows on ER diagrams. If we put an arrow, it means that the object on the other side can have only one value on the opposite side of the relationship. Referential integrity essentially means that the value must refer to an existing object. So this is the way we should enforce that a product is made by exactly one company. And the way to think about this operationally, I mean at the implementation level, is that if you have a foreign key from product into company, that foreign key cannot be null. That is referential integrity. It must refer to an existing object. And finally, there are more complicated constraints that depending on what formula it's merely reading for ER diagrams, you might or might not be able to represent. What does it say? What do you think? Sequence 3. So this thing here, lesson 100. What do you think that? What can it say? Less than 100 products are made by that company. Less than 100 products are made by every company. It's much harder to enforce a sequel. But if this is what you want to enforce, there is a way in which you can do this in sequence with a general assert statement. The last thing, or the books mentioned them, I need to mention it as well, they are called weak entity sets. And the idea is that sometimes you don't have enough information to have a key for your entity set. Instead, you need to borrow one from another entity set to which you have many one relationship. And the example I'm using is one with universities and teams. Maybe you have teams that have number. You have a squash team and there is squash team number one, squash team number two, squash team number three. So the only way you can identify teams is by their number. But now if you look at a different university, they also have a squash team number one, squash team number two and so on. So you need more information to have a key for the team. So that information comes from here, from the university. So the definition of a weak entity set is that its key consists of its own key plus a key of all the entities to which it refers through a weak relationship. So how would you represent this? Clearly we have a table for university, but how would we represent team? How many attributes do we store in team? It's free, right? We need a sport, we need a team number, but we also need the name of the university. So we need three attributes, yes. Do you then want to use the circle thing on the university instead of the arrow? Let me erase because I put too many things. So why did I use A? What do you want to use the circle, half circle thing instead of the arrow? Ah, to have this. That means it's an attribute. Oh no, sorry. The semi-circle instead of the arrow? This one here. You're right, it makes sense. But every book uses the other arrow. It's probably hardwired in the semantics of the weak entity set. Notice these two lines? This is an important information. It says this is the relationship you need to follow in order to get the key. And it's probably hardwired. Again, this depends on what conventions you adopt for ER diagrams. But the convention here is that whenever you have a weak entity set, then it means that the arrow is actually the circle. Okay, so I have a fun question for you. Are weak entity sets clear? Because if they are clear, then I have this big question for you. Here are some weak entity sets. And my question is, you look at R, and you have to tell me what are the keys of R. So I'm going to list them here. A is part of the key. What else would you include in the key? C. You're faster than I am, so you go this way and then you get to C, right? What else? D. Because from here we go this way and we get to D. Sorry? H. Because we also follow this weak entity sets, we get to H. What else? E and F, because we go this way, we get to F, and then we also get here, so E and F. Because we can also go this way and we get K. What about L and G? L. No, we don't get L and G because this is not a weak relationship. Okay? This is why the example is a little bit tricky. So we only follow the weak relationships. We don't follow the tender relationships. Good. And this finishes my discussion of entity-relationship diagrams. Does that mean that all those weak entity sets in the previous slide will have the same keys? Will they have the same keys? Good question. What about S? What are the keys of S? K, E. H, is that it? Yes, H, K. K, F, E, E. Yes? No A. No A because, let me erase, the arrow is also important. You only follow the arrow in one direction. So good question. What are the keys of S? Well, here they are. They are H, K, F, and E. And that's it. But normally for the database like this, we would take L as well. No. If you want L, then if you want to take L, then you need to draw it like this. You need to make this relationship a weak L. But for many to one relationship? They have different amounts. So the question is, why don't we take L as part of the key? And we don't take it because of this convention of how we represent relationships. If we represent them with two lines, then the keys are borrowed from the other side. If it's with a single line, then they are standard relationships and no key is borrowed. Any more questions? Please look at this at home and let me know next time if you select. It's a relatively simple example. Once you get the logic, it's obvious. Good. Now we get to the most mathematical if you want, the most formal part of the entire course essentially. Which is called design theory, which was developed in the late 70s and early 80s. It is very widely used today in every database design. It's how to make sure that the database design doesn't have so-called update anomalies. So let's go through this. The idea is that once you finish your base design of a database schema, once you finish your ER diagram and translate them into tables, the resulting design is not good. It has some problems. In order to make it good, you need to transform it. And depending on how you transform it, you get to one of several so-called normal forms. These have been introduced historically at different times and sometimes you represent different things. The first normal form is completely different from the others, and it says that all relations are flat. The second normal form, forget it. It's obsolete, nobody discusses it anymore. The normal form that we will study today is called the Boy's Code normal form. And there is another one that's very popular. It's called the third normal form. And I'm not going to discuss it in class because it's more messy and something more difficult. And for practical purposes, it doesn't make such a big difference from the Boy's Code normal form. The book has a good treatment of the third normal form, so you can read from the book. So I'm going to go over the first normal form in one slide and then spend the rest of the lecture on the Boy's Code normal form. What does the first normal form do, say? It says you should not have nested tables. That's what it says. If you want a table like this, where for every student, what you want to do here is you want to say, for every student I want a nested table inside the record representing that student that gives me all the courses. Why can't I have it? Well, you don't need it. You can always represent this information by flattening, by transforming it into a first normal form where the student is one table. The courses are in another table. And then what do we have in between? A relationship, a many, many relationships which is called takes. And that is the right design. Everybody follows the first normal form. All tables are flat. Now, in today's SQL systems, you can also represent nested relations. And there was a lot of theory work, essentially at the end of the 80s and early 90s on how to extend query languages to deal with nested relations. But as a matter of design, it's a much better design if you start with the first normal form in the flat tables. Good. So the boys' core normal form is much more difficult. What it does is that it starts from the relational schema that you obtain from your ER diagram and it will identify certain bad things that happen to schema and it will split the tables vertically. It will do some vertical partition of the tables and then the result is a much better design. And that's called the boys' core normal form design. So what's wrong? What can be wrong with the schema? Well, three things can go wrong. They're actually related. They are called data anomalies. And they are very popular today among database practitioners. They are called the redundancy anomaly, update anomalies, and delete anomalies. So let's see them. And we're going to go a little bit slower over this example. I think it's an important concept that I'd like you to take away from today's lecture. This is a table where we store people, their associated security number, their phone number, and their city. And it's a great table. What's wrong with it? We have a name, we have associated security number, we have a phone number, and we have a city. Now, of course, some people have two phones like Fred. And that's okay. We store Fred twice in the table because he has two phone numbers. But this design exhibits all three anomalies. Actually, these three anomalies are not so distinct. They're fastest of the same problem. These three anomalies are redundancy. Data is repeated. Watch which data is repeated here. This is the city. This tells you where Fred lives. Okay? And it tells you twice that Fred lives in Seattle. Which means you can get it wrong. You can place Fred in two different cities just because he has two phone numbers. So it's a redundancy anomaly. Update anomalies. As a consequence of this redundancy anomaly, Fred moves to Bellevue. What do you need to do to move Fred to Bellevue? Update both records. You need to update both records. What happens if your application has a bug and only updates one of the records? Then you're a trouble, right? You assume that every person lives in one city. That's what you assume that every person lives in a single city. But now, because of your bug and the application, Fred suddenly lives in two cities. That's the update anomaly. The deletion anomaly, Joe, who is ahead of us, doesn't use a phone anymore. He uses only email address. So how can you delete a phone number of Joe? No nulls, please. You can set a null, but no nulls, please. You're not allowed to use nulls. How do you delete a phone number? How do you delete a phone number for Fred? You just remove that record. How do you delete Joe's phone number? You just delete it. But now what happens? What is the anomaly? We don't have Joe anymore. We don't have Joe anymore. That's a delete anomaly. You wanted to delete just a phone number, but you ended up losing more information. OK. You know how to do it better. How would you design this table better? SSN and this city were separated. So please, SSN and city and one table, and? Name SSN and one table. OK. So we also place name. So again, SSN, name and one table. What do I place city? The same. I put it in the same table. And this would be one table. The table T1. SSN and phone number. And then in the table T2, place SSN and phone number. That's a much better design. In this way, since SSN will be a key, every person lives in one city. And if you want multiple phone numbers or no phone number at all, you can have as many phone numbers or as few as you wish. OK. That is a much better design. It's right here. And we get there by decomposing the table into two, into this T1 and this T2. Notice that in here, the first one, SSN is a key. But in the second one, SSN is not a key. Because we emphatically wanted to allow a person to have multiple phone numbers. Here the key consists of both attributes, SSN and phone numbers. Just to make sure, all three anomalies have disappeared. There is no more repeated information about where Fred lives. Fred lives in Seattle and that's it. That's all we record. If we want to move Fred to Bellevue, we just move him to Bellevue. It's a single update. And if we want to delete Joe's phone number, you do this. You delete one record from here. And we don't lose the other information about Joe. So the question we are going to address is how do we get from the top design to the bottom design? What was wrong about the top design that caused us to get to the bottom design? And essentially what was wrong there was that every person had to live in a single city. But here we had the city repeated. But every person had to live in a single city. That was the essence of that anomaly. And this is explained with a beautiful theory, which is a theory of functional dependencies. And that's what I'm going to show you next. So a functional dependency is a constraint. It's a statement about what should hold in the data. It's a particular constraint. The precise definition is here. So let's look carefully. A functional dependency is written like this. And it says that some attributes functionally determine other attributes. The A attribute in this case functionally determines the B attributes. What does it mean when we say that in a table the A attributes must functionally determine the B attributes? What do we mean by this? If A is a key, then this functional dependency always holds. But we will look especially carefully at functional dependencies where A is not a key. So here is what it means. It means that if you have two tuples, so suppose A is not a key. It means that if you find two tuples that have the same value for A, then they must have the same value for B as well. And that is written mathematically here. So it says that for any two tuples in the relation, if these two tuples agree on all the A's, if A1 is equal to T prime, A1 and so on, and TAM is equal to T prime, AAM, then these tuples they must also agree on the B's. That is a definition of a functional dependency. Now going back to our, going back here, can you give me an example of a functional dependency? Yes? Name, social security, say again? Name, social security. Or name, social security, number and city. Where do you put the arrow? Name, social security, number and city. Yes. Actually even simpler. Here is a functional dependency. I hope you're with me. The social security number functionally determines the city. Social security number is not a key. Look, it occurs twice. But the two tuples where they agree, they also agree on the city. And it's actually a statement that we're making that every social security number, which essentially means that every person must be in a single city. It's a constraint that is part of our application. It's part of the information that we need to identify when we design the database schema. Good. So let me go back to my example. So I have here other examples. If you look at this table, I claim that this functional dependency holds. Employee ID, let me start with the second one. Let me start with blue. I claim that position determines the phone. Do you agree? Yes. Well, we need to find pairs of tuples that have the same position. Can you find two such tuples? Second and third, these two. And then we need to check that they also have the same phone. Do they have the same phone? Yeah, right here. So this functional dependency holds. Yes. So just because you have that property, that doesn't mean that it's always going to hold for the future, right? Like you wouldn't say that about city determining social security in the last one. This is a very important observation. It's a sort of confusion. The fact that it happens to hold on this small table that I put on the screen, that position determines the phone, doesn't mean that it will always hold. This is not the way you should discover functional dependencies by examining the data, except in the homework, where you have exactly such a question. I give you 1,000 records, and you have to discover the functional dependencies. But in practice, how do you find out about these functional dependencies? Think about it. You think about it, but essentially during your analysis, you understand the application that you need to design. And the logic behind that application, the requirement analysis, will tell you what functional dependencies hold. For example, imagine yourself talking to your customer and he says, oh, in our company, we save on phones. And every position has a single phone. And this is what they tell you. It's a rule in our company. It has been there for 50 years. Every position has a single phone number because we save on phones. If this is what they tell you, what functional dependency do you write down as part of your constraint? This one. That the position determines the phone. Because you know that for every position, there is a single phone number. Does this make sense? That's how you collect the functional dependencies. Now let's get back to this particular example. I claim that the green one also holds that employee ID determines the name for an imposition. Why is that? Yes? Because it's a primary key. Now, if you think back about the definition, it's because you can't find two distinct uppers with the same employee ID. So the condition holds vacuously. But phone doesn't determine the position. Can you give me a counter example? 1, 2, 3, 4 is the second book. Yes. Because this tupper and this tupper, they have the same phone. But they have different positions. This company saves so much that actually they share the same phone number between two different positions. Okay. So that is how we should think about functional dependencies. So I'm going here through several details. But I invite you to think at home about the other two examples, this one and this one, and check which functional dependencies hold and which don't hold. But I want to show you something very interesting. I claim that if you have a table, any table on which these three functional dependencies hold, then this functional dependency will hold as well. In other words, once you list these functional dependencies, these are not all. Others must hold by necessity. Others will follow from these functional dependencies. Why is that the case? Sorry? Well, I think we need to really think according to the definition. So imagine the table. Imagine two tuppers in this table. And we need to check the blue functional dependency, right? So these two tuppers, they agree on name and category. So if you think about the name and category somewhere here, they have the same values in the name and category. We need to prove that the price also agrees that these uniquely identify the price. How do we prove this? Now you look at the red functional dependencies. What do you use from there? You know that all three of the red functional dependencies hold. Yes? That's the first one in the last one. So you use the first one and say, look, because they agreed on the name, which is here, they must also agree on the color. So this would be color here. They agree on the color as well. And our color and category, if you followed my little dots here, these are the color and category, they agree, and therefore they must also agree on the price. And bingo, these two tuppers Now the question to ask is the following. I made this claim, which I didn't state explicitly, but I'm going to make it now, is that by identifying the functional dependencies, this is how we will identify that a design is bad. Anomalies, update anomalies, they are related to certain functional dependencies. But what we discovered right here is that once we list three functional dependencies, that is not all. There are other functional dependencies at all, and therefore we must find them all. What is the algorithm? What is the method that allows us to start from a set of functional dependencies and compute all the other functional dependencies that are logically implied by these? And this is a piece of theory. It's a beautiful piece. It was developed by Armstrong in the 70s. And it resulted in three simple rules. They are called the Armstrong rules. And you'll find them in different variations in the literature. The simplest expression of the Armstrong rules is what we have on the slide right here. And I don't know if the book follows this or a different variation. I'm going to show you the Armstrong rules in their simplest instantiation. Again, keep in mind that the claim is that these are all the rules that you need to follow in order to discover all the functional dependencies that are implied by a given set of functional dependencies. So let's see how complicated these rules are. The first is called splitting and combining and says that if you have a functional dependency like this where some a is determined some b that's equivalent. You can move back and forth between this and several functional dependencies where you list the b's separately. So if you have the rule at the top, you can replace it with the rules at the bottom and vice versa. If you have the rule at the bottom, you can replace them with the rule at the top. Splitting and combining. Do you agree? Does this rule hold? It holds trivially. It's a very simple rule. Now the second rule, now you expect to see a hard one. The second rule is actually even more trivial and it's called the trivial rule. And it says that it's always a case that a set of attributes functionally determines one of them. Nothing to check here. If you have two tuppers that agree on all these attributes, then of course it's going to agree on one of them. So where is the b? Which rule will do the heavy lifting? Yes? Sounds like you're just defining some algebraic grammar. I'm defining a set of axioms if you want that when applied will allow us to derive all functional dependencies. I mean, it just sounds the standard axioms when you're doing it. Yeah, I think they're sort of algebra. Well, axioms are numbers. The question is if these are some standard axioms and they are not more standard than what I just called them. They are called Armstrong rules. They apply only to functional dependencies. Now if you dig deep into theory, you can cast them in different formulas but they were essentially discovered in the context of functional dependencies and they are quite distinct from other rules that you might have encountered like deduction and logic or algebraic rules for proving algebraic identities. There are rules in their own right that allows us to deduce new functional dependencies. So I think what you're trying to do now is you're trying to connect them to something as you might have seen in algebra and you won't be able to do this. That's my point. They are specific to this particular problem of finding functional dependencies. So let me show you the last one, which is a single one that's somewhat more... less trivial. Let me put it this way. It's called the transitivity rule. And it says this that if the ace determines the beast and the beast determines the c's, then the ace determines the c's. So why is that? Because, look, if you have two topers and where the ace agree, then by the first rule, they must agree on the beast and by the second rule, they must agree on the c's. So therefore, any two topers that agree on the ace must agree on the c's. Okay, they are very, very simple. So now let me show you how we put them into action and from rules to derive quite interesting consequences from a set of given functional dependencies. So starting with ace free, the blue ones here, I want to derive the last one. I want to derive the red one, this one here. But I'm going to do it step by step by deriving some intermediate functional dependencies until I reach the last one. Okay? And my question to you is which rule do I need to use? Look at number four. How can you derive name and category that implies name from the blue ones? Yes? Second rule. What's the name of the Armstrong rule? Thinking about ace free rules. Trivia. Is it a trivia rule? Yeah, it's a trivia rule because name is right here. So this is a trigger. Now I see why you said two. It's a second after a group. How do we derive number five? Name and category determines color. Transitive rule. Is it a transitive rule using which one? The transitivity needs to combine two rules. So it's one and... Name one. So I have name and category. It's between four plus one. You see four takes me from name category, which I have here, but it takes me to name not to color. But then name takes me to color. Combine them. And from name category we get to color. Exactly what we need. It's a very mechanical application of the transitivity rule. How do we get six? It's a trivia rule. How do we get seven? Splitting and combining. Actually just combine. Which rules do we need to combine to get seven? Five and six. Five and six. Because they have the same left hand side. So we combine them and we get color and category on the right. And how do we get eight? Name category implies color. And what? We are not here yet. How do we get eight? That's between seven and three. Transitivity between seven and three. Seven takes us from name category to color category. Three takes us from color category to price. Therefore by transitivity, name category takes us to price. This is hard, right? You don't want to do this by hand. You don't want to write a program that combines these rules. It's difficult. I'm going to show you a different way. I'm going to show you a different way to find all functional dependencies determined by a set of functional dependencies using a notion that's called a closure. And here it is. Think about a set of attributes. A1, A, A, N. The closure of this set of attributes consists of all the other possible attributes that are functionally determined by A1, A, N. As much as you can determine by applying repeatedly the functional dependencies. So for example, let's look at name plus. Ask the question, what does name imply? Clearly it implies name itself. But anything else? Color. Anything else? No, I don't see anything else by applying these rules. What about name and category? Where name and category clearly implies name and category, what else does it imply? Where color, because name implies color, what else? Category implies department. Because color and category implies price, we also get price. And category implies department. This is why we get department. So the closure of name and category is much bigger. What's the closure of color? Color is an orphan. It doesn't imply anything. The closure of color is just color itself. So the algorithm for computing the closure is actually quite simple. You start with a set X. And it's a standard closure operation, which is really like a closure in algebra. So at each step, you look for some functional dependency in your set of functional dependencies. You don't have to consider any other in the given set of functional dependencies, such that the left-hand side is included in X. If you find such one, then you add the right-hand side. You add C to X. And you repeat this as long as you can increase X. And when you can't decrease X anymore, then you stop. And this is called the closure of the functional dependency. So let me show you this on a different example. Any questions so far about the closure? So let me show you the closure in action. What is the closure of AB plus? It's clearly AB. What else do we list here? AB, because of this. Then D, because B implies D. And now E, because we have a D which implies B. And that's it. What about AF plus B? And now we should be... Everything. Now everything, right? Because once you have an A and B, then all the rest are here, so CDE. And that's all. They are all the attributes. So here's some intuition of how we can accelerate this closure computation. The closure is monotone. If you have the closure of AFB, it's bigger than the closure of AB. So you can simply copy what you saw before. So now, how do we use closure to compute all the functional dependencies? Well, if you think about what you need to find, the set of all the functional dependencies, they are things like this. X determines some attribute A. Because by splitting and combining, you can restrict the right-hand side to a single attribute. Once you have all the closures, the question, does X functionally determine A becomes, does A belong to the closure of X? In other words, the functional dependencies that are logically implied by the set of functional dependencies are all of the form X implies A for A in the closure of X. So let's go back because you've seen this example. So what interesting functionally dependencies can you give me for AB? What does AB imply? What did we just discover by computing the closure? Well, everything else, CDE. That's the notion of closure. Should I also write AB? Should I maybe write AB implies ABCD? We can. But on the other hand, AB always functionally determines AB. That's a trivial function. So we can spare information. We can spare adding too much information by dropping these and just saying CDE. What can you tell me about AF? What does AF functionally determine? This is BCDE. So the closure is a shortcut for finding all the functional dependencies. It's also a way to encode more efficiently all the functional dependencies. You don't need to list them, but if you have the closure, then you have all the information you need. Good. So now once we have this, we are making pretty good progress. I'm going to skip this example here. But please go over it at home. We need one more information in order to understand how functional dependencies helps us avoid the data anomalies. And that is the notion of a key. But we haven't discussed this, but obviously if you have a set of attributes that functionally determine all the others, then that set of attributes is a key, right? Because you can't have two tuples that agree on this set of attributes. Because those two tuples must agree on all the attributes, and therefore you would have duplicate powers. So that's the definition of a key and the theory of functional dependencies. And it corresponds exactly to the keys that you know in SQL and ER diagrams. But there is a record here. We actually don't call that a key. We call it a super key. A super key is a set of attributes that functionally determine all the others. So it's right here. And the key is a super key that is minimal. Did we lose connection? Rod, can you please help us? We don't seem to have a technical problem here. Somebody said no on the other end of the thing. Can you hear me at Microsoft? Yes. Okay. Can you see me moving? Yes. Oh, I can't see myself moving. That sounds like a medical problem. Okay, so this is a key. It's a minimal super key. So let's see this in action. Here is a good example. What are the keys here? What is the key for this set of functional dependencies? Is color a key? No, name. Is name a key? So color is not a key. It doesn't functionally determine anything. And you need to determine all of the options to have a key. So what is a key? A category name and category. Name and category. Because name and category determines price, and category itself determines color. So this is a key. How would you quickly check if a set of attributes is a key? If I give you a set of attributes, call it X. Is X a key? How do you check this? If the closure contains all attributes, if X plus is a set of all attributes, then it's actually not a key. Then it's called what? Then it's called a super key. And if you can't make it smaller and still have a key, then it's a key. Okay, so let me see. I have here more examples, and I'm not going to go too slowly over them. I think we have enough information. Please try this at home. Find all the keys for this example. It's just to make sure you understand the notion of keys. Finally, we get back to anomalies. Remember what we discussed? That we had set functionally determined functionally dependent that bothered us. That the social security number by itself determines the city. And we didn't like that. We wanted to take social security number and sit and put it somewhere else. That's the idea. The idea is that a function dependent is okay as long as the left-hand side is a key or a super key. But if that's not a key, if it's just a set of attributes, then it's not okay. We have to take it to put it aside. Actually, I have the example again. What is the key in this table? Assuming that the only function dependency that we have is this, that's a social security number determines the information about the person, but not his or her phone. What is the key in this table? Is this an iPhone number? It's a SSN and phone number, exactly. This is the key. So then, is this function dependency okay? This is not okay because SSN itself is not a key. That is the essence of the anomalies. This is why we have all the anomalies. Because whenever we update or we delete or something like this, we have to account for this hidden function dependency that is between some attribute that's not a key and another attribute. That's what we want to avoid. And this is exactly what the Boyce-Code-Normal form does. So let me skip this and give you the definition of the Boyce-Code-Normal form. The definition is as follows. A relation is in Boyce-Code-Normal form if for any functional dependency, A implies B's, if it's either a trivial one, we can't avoid the trivial ones, or the A's form a super key. In essence, it says no functional dependencies except those that come from keys. Now, if a table satisfies a Boyce-Code-Normal form, then the anomalies disappeared and you will see this once we look at examples. But if a table does not satisfy the Boyce-Code-Normal form definition, then we need to normalize it. We need to split it into smaller tables until every table satisfies a Boyce-Code-Normal form definition. How do we split it into tables? Well, here is the main idea, but I will show you a better algorithm in a couple of slides. But the algorithm that you get directly from the definition is the following. You choose one of these functional dependencies that violates a Boyce-Code-Normal form. So your A's are not a key, are not a key or a super key, which means that whatever you have on the right-hand side are not all the attributes, right? There are some other attributes that are not the B's. And it also means that the B's, they are not all included in the A's because otherwise that would be a trivial one. So to split the table, what you do, you will create a table with the A's and B's together and then a second table with the A's and the other attributes. And if you follow carefully, all these three sets in the Venn diagram, they are non-empty. So this decomposition does some progress. It's not vacuous. That's because the functional dependency is a Boyce-Code-Normal form violation, which means there are others and because it's not a trivial one, which means there are B's that are not A's. But instead of using this slide, you can ignore that slide completely because I'm going to show you a much better algorithm that is right here. And I'm not sure the book contains it. I think that's a much better way to describe the Boyce-Code-Normal form normalization process. So please try to follow this carefully. It says this, start with a table and pick a set of attributes X, such that when you compute its closure, its closure is not a set of one attributes. What does this tell you about X? If I tell you that X plus is not a set of attributes, what does it mean? What is X or what is it not? It's not a key. It's not a key. It's not a super key. And also, X is not X plus. We have seen examples when you compute the closure and you don't get anything else. What does it tell you about if X is not X plus? Yes? It doesn't imply anything. It doesn't imply something. It would only be trivial relations. If X were equal to X plus, then you could have only trivial functional dependencies starting from X. But since it's different from X plus, there are some non-trivial ones. Okay. So you find such an X. That's your first step. Once you found it, you simply split your table into the following two tables. One consists of X union Y. This is just X plus because I denoted Y, the difference between X plus and X. The new attributes. So one contains all the attributes in X plus. And the other table consists of X and the rest. This is the attributes that are right here. The attributes that are not contained in X plus. Let me erase what I wrote so you can read the algorithm more carefully. So again, we find a set such that its closer is strictly bigger than the set X, but it's not big enough to contain all the attributes. And that by itself represents the violation of the voice code normal form. And then we split the table into two sets of attributes. One set contains X plus. This may be a better notation here. We just put X plus. And the other table contains X. And all the attributes that are not in X plus. Okay, so let's see this in action. And I'm going to use this example here. Let's go through this together and slowly. Here is a table person. We have a name. We have associated number. We have age. We have a hair color. And we have a phone number. And somebody tells us that the functional dependencies that hold are these two. The social security number determines the name and age, and the age determines the hair color. What are the update anomalies? Hard to see at this level, right? But it's much better if we check for the voice code normal for violations of the voice code normal form and if we find some, then we normalize. So help me. Help me find a set X that represents the violation of the voice code normal form. SSN. SSN, let's try it. If it's wrong, then we try again. What is SSN plus? Should be strictly bigger, but not too big. What is it? Age. So let me start with SSN. Then we have name, age, and hair color. So it's clearly bigger, but isn't it too big? Does it contain all the attributes? No. This is a violation of the voice code normal form rule. So we split the table person into two tables. Let me call them R1 and R2. R1 contains which attributes? All these. So it contains SSN, name, age, and hair color. And R2? What does it contain? Sorry? SSN and phone number, which is the other attribute that was not included. What are the keys? What is the key in R1? SSN, how do you know this so fast? What is this closure? What's the closure of SSN? Everything in R1. That's how we build R1. That's an important observation. The way we build R1 is to ensure that SSN is the key. What is the key in R2? It's an N phone number. N phone number, emphatically not SSN by itself. Because SSN has nothing left to imply in R2. We didn't think anything that SSN implies. So that's the key. This is how we normalize. This is a voice code normal form. But actually, are we done? It's not clear yet. We need to continue to check if the resulting relations are in voice code normal form. Let's do this. Let's start with R2. Is R2 in voice code normal form? Well, you need to check this with three relations. SSN, phone number, and both together. It was three sets of attributes. And you check them and it turns out, yep, this is in voice code normal form. So here we are done. But let's see R1. R1 has many more attributes. Can you propose a set X of attributes in R1 which violates the voice code normal form definition? Yes? H, let's try H. So let's compute this. What is H plus? The beauty of this process with closure is that we compute the function dependencies on R1. We just look them up at the top with remembered not to include phone number. We just skip whenever you find phone number. So this is H and here color. So clearly it's bigger than X. This is good, but isn't it too big? Does it contain all the attributes? No, it's missing name. So how do we further decompose R1 into R3 and R4? R3 will have H and here color, our X plus. And R4, no H and, yes, now SSN and name. What are the keys? What is the key in R3? H. What is the key in R4? SSN. And that is because it was, essentially because SSN plus is H and name and SSN. And now are we done now? Now we are done. We still need to check for R4, but it turns out that R4 is already in voice call number 4. So what is our final design? How many tables do we need to, how many tables do we have in our final database design? Three tables, namely R2, R3 and R4. These are the three tables and the keys are listed right here. So if you step back now and look at what you have achieved, you have actually identified interesting semantics in the data. The fact that you ended up with these three tables actually tells you something about the semantics of this data about hair color and ages. What does R3 contain? What is the information that we end up storing in R3? Maybe we can give a better name. What do we store in R3? How would you call this table? You know what, I would call it person. This is the base information about the person. We store it's social security number, its name and its age. His or her, name or his or her age. Why can you tell me about R3? What is the semantics of R3? Every age comes with its own hair color. That's what we have in this application. So you don't store this in the person's table. We store it in the hair table. So this called hair. For every age, we have a certain hair color. That's the semantics that we discovered. R2 is the funny one in which we allow multiple phone numbers for the same person. That's the phone application. Is there a question there? I can't really see. I think my screen is frozen. Is there a question to Microsoft? Yes. What if the set of possible ages is not finite? So we can represent only a final set in a table. But if a set of possible ages is not finite, suppose there's a floating number, then we can't really put it to a table. So the question is a very valid question. How do we think about the case when the association between a key and its value is infinite? Well, but think about how we got there. We got there by starting from a finite table, which was not normalized. And we split it to smaller tables. Whatever we end up putting in those smaller tables, we only come from the bigger table, which was finite. So you end up only storing a subset of your function. If you want, it's going to be a finite subset of the graph of the function. That's what you store in the table. So Rod, can you hear me? Can you please fix the image? Because I don't see. I can't do it unless I read the whole system. I see. They are going to have to speak up if they want to ask a question. OK. So I can't see you moving. Does this answer your question? OK. Any other questions from Microsoft? OK. We have a question here. You may tell us later. But right now, this is an editorial kind of algorithm, though, right? For a large number of centers, it's very taxing. Right. So if you think, the questions your colleague is asking is, think about a huge table. It has n attributes, n being like 100. And you have these functional dependencies. And now we need to compute to do the boys called normalization. What will be the complexity of this algorithm? Will it run in n square or in n cubed time? What is your first impression? What would be the running time of this normalization algorithm? How would you call it more professionally? Sorry? NP complete. It turns out that the normalization is NP complete. If I remember correctly, yeah, it isn't NP complete. But the algorithm that we described is exponential. It runs in exponential time. Now, factorial is a particular kind of exponential. It's still exponential. And the fact that normalization is an NP hard problem proves that, unless p is equal to nt, we will never be able to find significantly more efficient algorithm. So, yeah, it's computationally expensive. Does this matter in practice? Well, it probably matters if you write a program that you try to sell. And then your customers will come back to you with these tables with 1,000 attributes on which your program doesn't work. But if you are a designer and you design, if you are a database designer, then probably you will use a combination of functional dependency theory and just good intuition in order to design your n-schema. So it's not so critical. The correct design is not so critical. It does not depend critically on applying the theory blindly. But it's also based on just plain good intuition on how to design the tables. But the observation is absolutely about it. The algorithm that we described is exponential. And moreover, the problem itself is inherently complex. It is NP complete. So this slide essentially describes what I just showed you. I have another example that I would invite you to study at home, which is this one here. But it doesn't add any more information to what I already showed you. What you should keep in mind is that the normalization process essentially results in a tree. In a tree in which at the top, we start from one single relation. And then we split the relation into two. And then one of them is further split into smaller ones. And the end design consists of all the relations on the leaves of this search tree. OK. Now a question to you. Suppose I give you a relation that has not 100, but only two attributes, A and B. Can you have a violation of the boys called normal form? A plus equals A. What if A plus is equal to A? Is this a violation of the boys called normal form? In order to find a violation of the boys called normal form, you need to find a set x such that its closure is how? Bigger than x, but not big enough to contain all the attributes. So this is not a good example. OK. An example if the closure of an empty set would be an A. If the closure of which? Of an empty set. Ah. If the closure of the empty set would be A. Now that's an interesting thought. We never discussed empty sets when we chose x. We actually understand that x is never allowed to be the empty set. But what would that mean? It means that A would be a normal form, like A would be y or something. Exactly. This is what it would mean. It would mean that A has a constant value. Because any two attributes you take, sorry, any two topics you take, they always agree on the empty set. There is nothing to check. And therefore, if this is what you want to say, that the empty set determines A. You're essentially saying that A is a constant. But you know what? I don't want to go there. Forget this. This is theory. In the textbook presentation of functional dependencies, and that's how it was developed historically, you're not allowed to put the empty set on the left. So my question is a very simple answer. No. If you have only two attributes, there can't be any violation. This is always in Boyce code number four. Unlike a relation with three attributes. Can you give me an example of a set of functional dependencies for which this is not in Boyce code number four? Just A implies B. Right? What would be the key here when A implies B? A and? C. A and C. And therefore, A implies B is a violation. How would you decompose it? Into R1 consisting of A and B, where A is a key. And R2 consisting of A and C. Where the key is both A and C. Where the key is both A and C. Thank you. OK. I have another question. Let me use green. I'm going to use this. Is a key unique? In SQL, in SQL, every table has a unique key. But now with the theory of functional dependencies, a key is not something that you decide to make a key. It's something that's defined by the set of functional dependencies. Can it be the case that you have more than one key, that you have two keys? And I would like you to think about the simplest example when you have just two attributes. Can you have two separate keys, A and B are key. But can you have both A being a key and B being a key? It's not a compound key. I don't want a compound key. I want each of them to be separate a key. Yes? It implies A. Exactly. Here is a simple example. A implies B and B implies A. Then A is a key and you can choose that as primary key on your table. B is a key. SQL forces you to choose one. But in the theory of functional dependencies, there is no distinction between them. They are both keys. I've done that. Sorry? I've done that. You've done what? Yeah, a relation where there's two keys. Yeah. I think this occurs quite often in practice when you just have two different identifiers for the same record. And then you need to choose one of them. Okay. So one last thought about the decomposition itself. So what did we do when we decomposed? This is what we did. We took a table with many attributes and we split it into two tables with a subset of attributes. This is called decomposition. And the way you populate these two tables, if you hand the data for the first table, you just project. You project on the A's and B's to populate R1 and project on A's and C's to populate R2. Now my question is, does this always work? For example, look at this table here. We had name, price, and category. And I decided to store them into two different tables, name and price, and name and category. Okay? So in name and price, we get gizmo in 1999. We get this price. We never store it twice. Relations are set. And to duplicate, tuples are just not present. You just have unique tuples. So how would you describe this decomposition? Is this a correct decomposition or do you think it's incorrect? In other words, if your initial data was at the top, this one. But I don't give you this. Instead, I give you these two tables. Can you uniquely reconstruct the top table from the two tables at the bottom? Yes. In this case, what do you need to do to them? Join them on name. Join them on name. And after you join them, you get exactly the table at the top. Is this always the case? Can you always choose some attributes and split it in two and then join them back and get the original table? It depends. When we get the same table, then it's called a lossless decomposition. But sometimes you don't have a lossless decomposition. Look at this example here. So here I split it into name and category and price and category. So what happens if you join them back? What are you missing or what goes wrong when you join back the two tables at the bottom to try to reconstruct the table at the top? Yes. Both instances of cameras. Those instances of camera, these two, they will join because that's what you can join on. You would join on category. And you get all four combinations, right? But that's not what we had. We only wanted these two combinations. So that's a problem. This is a lossy decomposition. So sometimes it works. Sometimes it doesn't work. But when we apply this to the Boyce-Code-Normal form, it turns out that if the functional dependency A implies Bs, holds, then the decomposition into A Bs and A's and the rest is always lossless. And if you think about it, what happens behind the scenes, we will convince yourself, and actually I advise you to do this at home, convince yourselves that once you partition the table into these two tables, when you join them back, you will reconstruct uniquely the information that you had in the original table. This is what makes the Boyce-Code-Normalization work, the fact that the result in decomposition is always lossless. OK. And that's what I had to say about the Boyce-Code-Normal form and normalization. Any questions? Yes. Theoretically it makes lots of sense, but because some of it was mentioned earlier, it changes some of these relationships that may not be true over time. Are there tools that actually do this and do people use this actively? It's a very good question that actually also buzzes me. So how much do people use this in practice and how much do they just rely on the correct design? So clearly it offers a good foundation. Whenever you are designing your initial database schema, it's good to keep this in mind, even if you end up using more than intuition than the exact normalization steps. If the database changes, if the requirements change and some of your functional dependencies disappear, that's kind of the most likely scenario, then you're in trouble anyway. It doesn't matter how you arrive to your design. If some of your assumptions change, then you need to change the schema and this is a painful process that essentially affects your applications. All your queries need to be rewritten over the redesigned schema. So it's something you'd like to postpone as much as possible. But it's a very popular piece of knowledge. People know about, especially about the third normal form. They understand functional dependencies and usually database practitioners, they know what the functional dependencies and normalization gives you. Namely, it allows you to avoid anomalies. So it's like knowing calculus. Very few people apply calculus, but it's good to know it because then you understand better what you're doing. And if anyone has insights about what you're actually doing, what you're doing at your job, if you use normalization theory or just plain old good intuition for a good design, let me know. I'm interested to hear this. It's a question that buzzed me. In terms of tools, there are plenty of tools out there that do normalization for you. So, yeah, you can find tools. Do people use them? I don't know. I don't have a good sense. I didn't hear too many people using the exact rigorous stop for normalization in practice. Okay. Please take a look at the book for next time. It's a set normal form. And we will have a brief discussion at the beginning of what the differences are between the boys' code normal form and the set normal form. The set normal form is much more confusing. Please take a look at the book. And if there are no more questions, then have a good night. Please remember to turn into a homework on Tuesday. And you can already start working on the second homework.