 Hello, hello everyone. Welcome to the principles of database systems. This is 544. At Microsoft, can you hear me? Yes, yes. Okay, perfect. So, let me start with introductions. My name is Dan Tsuchou. I am a professor in the Computer Science Department. I've been here since 2010. My research interests are in data management, especially in the principles and the theory of data management. I'm especially interested in large scale uncertain data, probabilistic databases, it's also in security and privacy, these are related. And that's what I'm especially interested in. We have two TAs, Jessica Leung over there, and the people at Microsoft see her. She's pretty, if you don't see her, she's right there. People at Microsoft can't see you. And Param, who is there, and again, sorry for people who can't see him. And we have technical support is Fred Vidjon, who is in the back there, and Matt McLeanley from Microsoft, whom I can't see either. He's probably in the room at Microsoft. So, this is our stuff. Regarding office hours, I'm still debating with myself and I wanted to ask you, if it's better to have a fixed time when we have office hours, or to leave this by appointment. What is your, do people have any preference? How many people would prefer to have a fixed time for office hours? My options I think are only search days after noon. If I suppose you want to have this latency afternoon, would you prefer that way? Okay. So, then I can set up a regular office hours sometimes 30 afternoons. If you want to meet with the TAs, please send them an email. So, for Jessica and Param, office hours will be by appointment. Okay. Let me try an advanced here. So, we have a web page. I know many people have checked it quite often in the last few days, it was not ready. You checked it too quickly. Sorry about this. Jessica was very busy to set it up, and if you download it to the slides, you probably got the old version. We refresh the slides every time we offer this course. So, the new slides were posted this morning around noon or close to noon. So, if you download this, please download them again. The lecture notes will be posted here every day of the lecture around noon or so, and the homework will be posted here. In fact, the first homework is already posted, and it's due in two weeks from today. I will tell you more about the homework. Any announcements will be posted there. I don't anticipate anything, but just in case. We also have a mailing list. This is a major communication tool. Please sign up on the mailing list. We only use it for announcements, for corrections to the homework, discussions, stuff like that. So, please subscribe to the mailing list if you haven't done this yet. Okay, we have two textbooks, actually. Our main textbook is by Ramakrishnan and Gerke, Database Management Systems. I did not bring it with me, but I suppose people have it. Who does not have the textbook yet? Okay, about half, actually, close to half. You can survive this course without getting the book, but there are certain parts of the material, but it really helps if you have the book. One is on predicate calculus, something I'm going to touch upon today, and I'm going to recommend that you read from the book, but it's okay if you pick it up from somewhere else. And the other is when you reach transactions, I'm going to describe areas, and the recovery system areas, and that is difficult, and it's pretty well described in the book. Beyond that, most of the material that I'm covering in class should be enough. However, I still encourage you to try to get the book if that's not too hard. You had a question? Basically, it doesn't matter if we get this over, just any decent book. I think it does matter for the material that I just mentioned. For example, we have a second textbook by people from Stanford, Garcia, Molina, Orman, and Widom, and this one does not cover transaction processing and query processing as deeply as a book by Ramakrishnan. However, this is why it matters. I don't know about other popular books by Novati, another one is by Dilber Shatz, and to be honest, I don't know how well they cover different topics. So I guess you can just try. If you have a book that you would like to use, use it and if it doesn't cover the materials the way we present it in class, then you know it's not good. So the second book, this is something that you might want to use if you are really not familiar with SQL. It has a more gentle introduction to SQL than our main textbook, and it also has a more gentle introduction to transaction processing, which I will cover in detail in class. So half of the lectures on transaction processing will be based on the second book, on the complete book by Garcia Morina, Urman and Widom. What I cover in class is more than enough, but if you want to read about that material in the book, then you need to get this book. Good, the course format we meet, as you see on Wednesday, 6.30 to 9.20. We are going to take a break somewhere in the middle, which according to my calculation is around eight o'clock. There will be seven homework assignments, and I will tell you more about them, and there will be a take home final, and I'll tell you more about it in a second. So 70 percent of the grade consists of the homework assignment, and 30 percent of the take home final. So the homework assignments, let me walk over them. This also serves as a course overview. The first one is on SQL. It's a crash introduction in SQL. How many people here do not know SQL? Okay, most of you know SQL. So this should not come as a surprise to you. However, I'm going to show you a certain subtleties of SQL that some of you or most of you probably did not know yet. So this is the first homework. The second is conceptual design. I will ask you to draw some entity relationship diagrams, compute a voice code normal form, and translate this into tables. Do a logical design following the conceptual design. And also to discover some functional dependencies in a data set. It's a fun homework, no programming here. So no programming for the second homework. Actually, there is a little bit of programming to find the functional dependencies, but it's tiny. The third is almost redundant, but I do want you to do this. This will be a small application in Java where you have to connect to two databases and issue queries from Java to these two databases. Since I know most of you have a daytime job or had a job, you probably have done this in one way or another, but I just don't want to take any chances. Everyone who takes this class must demonstrate that they know how to connect to a database through Java. So for most of you, the third homework will be very easy. There is nothing for me to discuss about this in class. It's just a homework that you need to do. The third will be about transactions. Again, no programming. This will require a deep understanding of serializability and of values recovery mechanisms. Not very deep. The questions are not so subtle, but you would need to go through the material in order to do this homework correctly. No programming, just answering some questions. The fifth is on database tuning. It's a relatively simple homework, that tries to give you a glimpse at database tuning as much as I could put together in a small programming assignment. Six, I'm going to lecture about XML and text query and text path. I know that many people need this in their job. It's good if you have one exposure to this technology, to XML, in a database class to understand the connection to data in general. XML is an instance of semi-structured data. And text query is kind of the sequel for XML. That's the way to think about this. And we will have one homework on XML and text query. And the last one is like the future. This is about data analytics in the cloud. I'm going to tell you a little bit about a query language that was specifically designed for large-scale data analytics in the cloud. I have no clue why they called it PIG. But this is the name of the language. It's called PIG Latin. Hard to find in Google. You have to type in PIG Latin query language, because otherwise you get all sorts of results. And according to people at Yahoo, this was developed at Yahoo, most of the data analytics that's been now performed at Yahoo is no longer done with MapReduce, but is done in PIG. PIG is translated into a sequence of MapReduce jobs. So think about PIG as a high-level declarative front end to MapReduce. And it's becoming more and more popular. So you probably will see a trend to replace MapReduce, which is low-level, with PIG, which is a high-level declarative language. So these are the seven homeworks. They will be due the day before the lecture. I want you to focus on the lecture, when we have the lecture, not on the homework. So they're going to be due Tuesdays by midnight. They're with a dropbox. Yes, Jessica? A slight adjustment to that. Add two minutes, so Wednesdays at 12.01 AM. Just a restriction for the turn-in software. OK. So Wednesdays 12.01. Don't make a Tuesdays 12.01 by Wednesday. On the whole check. OK. Any questions about the homeworks? OK. The final will be a take-home final. This relieves the stress. And the way I'm envisioning it is I'm going to post it on Wednesday night, the day of the last lecture, on Wednesday, December 8. It's a final. You can print it. You can read any books, if you need. And you have to turn it in by the end of the next day. Should I post it in the morning so people can have a good night's sleep? There's something that we can discuss when it comes closer to the final. We've been using take-home finals in the graduate level database class for a while. I switched the last minute in the last offering of 544p to the take-home variant. And people liked it. So this is the second time we're offering it. And now I'm going to plan for it from the very beginning. OK. So for the seven homeworks, you will work with different software tools. Most of them will be available in the computer labs here in the building and the CSE labs. But my past experience tells me that most of you will not come to work as a lab. Let me ask you, how many people here at Microsoft are planning to come and work in the computer labs as opposed to from home? Nobody. OK. So then this slide is very useful. The first and the third homework, in terms of the second, for this you need SQL Server, which is a Microsoft database system. And we have a SQL Server installed. And I have this on a different slide. It's called iProj-SRB. And I have more information about this on a different slide. You may want to connect to it. In order to connect to this, you already have an account, which is your CS account or your UDUB account. And actually, this is not a slide that describes how to access a project server. I have a different slide. However, you also need a client, a client on your machine to connect to this server. That client is either available in the CS building on the machines here. But if you want to put it on your laptop, you can get it for free from MSDNAA. I don't know what it stands for, but you have access to the site where you can download free software. As a registered student, I've been told that you have an account that is active on the first day of instruction, which means today. So you can go here. Here you use your CS account as your username. And I don't know more than this. I didn't use it. But apparently, you should be able to download the SQL Server client, which you need to install on your desktop in order to connect to iProj server. I don't think you can download from here the server. If you want the server, that's something that you have to figure out yourself how to get it from. Now, once you have the client on your machine, you can connect to SQL Server. But if you are outside of this building, then you won't be able to connect because of security reasons. You can still connect if you use tunneling, which is just a pain in the neck to use, but this is what you have to use. And we have instructions on how to use tunneling on the website. So you can read the instructions on how to use tunneling from the website. They work. They just have to follow all those steps correctly. For the third and fourth homework, we will use Postgres in addition to SQL Server. So this is a free database management system that you just downloaded and install on your laptop or on your desktop. And it's fun to use. Now you have a database server sitting on your desktop. And the third homework, the JDBC homework, asks you to connect to both of them. So now you have a connection to two database servers. And it's fun to figure out how to do joints between two different database systems. For the Xquery homework, we will use Xquery. Now in the past, we used Galax, which is with a wonderful implementation done by some researchers whom I knew. But apparently, they don't support Galax anymore. The website is still up. I tried to install it, and I couldn't this Sunday. Maybe you have more luck than I. But I managed to install Sorba, which is an implementation done by, again, a person I know, Dana Floresco. She works at Oracle. She leads a small research group that essentially does XML technology. And they developed this Xquery processor called Sorba that I managed to install on my Mac. But I should warn you, it took me almost half a day to install it, to install all the packages that it requires. So if you want to be proactive, you might want to install Sorba early. There is also Saxon, which is a very popular Xquery implementation from Apache. I haven't tried it, but you might want to go with this one. There are small differences between them, but nothing that you can't handle. The homework will be very simple. So whatever differences there are, we can tolerate them. Finally, for the last homework, we will use Pig Latin, this query language that is compiled into MapReduceJobs. You can download it from here. And again, I managed to install it on a Linux machine, and it worked. It was painful for me, so I hope you will have less pain. We will use this in two modes. One is on your desktop or on your laptop, where you can experiment. In addition, due to the generosity of Amazon, for each of them, we will have a free account on Amazon Web Services. And Jessica will send you an email about this account. Each of you will have a credit of $100, and you will be required in the last homework to run a very, very simple Pig query on multiple machines on Amazon Web Services, kind of simulating a real MapReduce analytics. The funny part with Amazon Web Services is that you have this credit of $100, but you have to give your credit card just in case we exceed this credit. Normally, you don't exceed this credit, except that when you're done, you have to shut off the server. Otherwise, it continues to run, and your account is continued to be built. And two years ago, in one of our database classes, we had a student who forgot to turn off the server, and I think he was charged something like $500 or so. We managed to cancel his debt to Amazon, but please don't do this. When we get to the last homework, I will remind you to please remember to shut down the server. So you don't have to pay Amazon. I don't know if they don't do this. They don't send you an alert. They don't terminate your job. They're just happy to charge your credit card, for some reason. OK, so these are the software tools. So here is my plan for today. I'm going to start with a brief general discussion about database management systems, and then I'm going to start the first topic, which is SQL. I'm also planning to finish SQL. So today is the day when I'm going to lecture about SQL. Next Wednesday, we will move on to conceptual design and logical design. So database management system. What is a database? Who can give me example of databases? A text file. A text file is a database. Can call it this way. Let me call it a more, give me an example of a database that more people recognize as a database. Library? Wow, an entire library. But do you call that a database? It has physical, it has books, right? Do you call the books a database? Of course, but it's terrible. It's an interesting thought. Let's move on to our topic. Let's find more traditional examples of databases. Yellow pages? Wow, you're really giving me a hard time. What about accounting? Accounting is a database, payroll. Payroll is a database. The database of students in this cloud, the database of students at university, there is a database. So it's a collection of related information that is usually stored in files and managed somehow by programs. OK, so that's an attempt to a definition of a collection of files storing related data. And we have lots of examples. I'm still thinking, is a library a database? I need to tell you the story. So libraries in the history of human civilization, they were the keepers of human knowledge. They were responsible for maintaining human knowledge. And one of their duties was to exchange manuscripts and to make sure that everybody has copies of the same manuscript, just in case one of the libraries burns down, that the libraries of the world, the global community, they maintain this information. So now, as we move into the 21st century, most of the information related to human civilization is electronic. So librarians, there is a smoothing among librarians who have written something to say, to try to extend their role of keepers of information to digital information. So it's kind of the reverse of what you're proposing. So they want to take knowledge and store it in a database and worry about archiving this knowledge over centuries, as opposed to try to think about their books as databases. Good. So let's move on to the next question, which is, what is a database management system? Yes? SQL. Sorry? My SQL. That's a good example of a database management system. That goes right here. What other examples of database management systems do you know? Well, SQL server. We just mentioned it. SQL server. Oracle. Oracle. What does IBM do? DB2. DB2. System R. System R, which was a precursor of DB2, which had a major role in shaping relational database systems. This was the first proof of concept that databases as state code invisions in 1970 can be implemented efficiently. And I will probably talk more about this in one of the coming lectures. So these are examples of database management systems. But what are they? How would you define a database management system? It's not a database. Yes? Oh, it organizes the data into a collection of tables with relationships between them. It organizes the data into a collection of tables. I would say it manages the data. It's an application. It's a large program that manages this data. It organizes it. It allows you to query it. It offers security. It offers concurrency control. It does all sorts of things for this data to be able to be used. So that's a database management system. Actually, I have a funny definition, which I paraphrased from SQL for nerds by Philip Greenspoon. What's a database management system? It's a big C program written by someone else that allows us to manage efficiently a large database and allows it to persist over long periods of time. How many people have read this chapter 1 from Philip Greenspoon's book? OK, I hope you had a good time. It's a fun read, and it's a very gentle exposure to database systems. Good. So databases are serious business. This is the market share of the commercial database systems from 2006. As you can see, Oracle leads the market, followed by IBM and Microsoft. So you wonder why don't we use Oracle? Well, it's because it's much more difficult to use, much less user-friendly than SQL servers. Also, Oracle doesn't have a good optimizer. They don't care about optimization. SQL server and DB2, they have the best optimizers, much better than the open-source database management systems. And I think I like the Microsoft optimizer best. It's the last 10 years, I think, it has overcome DB2. So this is why we use SQL server, not for any other reason. Yes? Is that basically side-based with extra stuff? Sorry, side-based? Yeah, there is also side-based somewhere in this list. What's the name of this Microsoft using side-based or variant thereof? Say again? SQL server is side-based version and plus one. SQL server is side-based version and plus one. I don't know what the details. But yeah, they did start from a certain code base. But they added innovations that come from Microsoft Research. For example, auto-tuning. Auto-tuning was a project that started at Microsoft Research. It was led by Sergey Chaudhury. He actually got the 10 years best paper award in 2008 or 2009. I forgot. No, sorry, in 2007. He got the VLDB 10 years best paper award in 2007 for the work he did in 1997 when he initiated this project at Microsoft Research on auto-tuning. The idea is to automatically propose indexes and ways to organize your data to improve the performance of a query workload. It's what a database administrator is supposed to do. It's very difficult. And he started to do this automatically. It's now a SQL server. And since he started this project at Microsoft, all the other database vendors, they have something similar. They have a similar capability. So as far as I know, SQL server is an innovator in the area of database management systems for this and for other reasons. IBM also has wonderful innovations. Oracle is focusing on transaction managing. And I have a funny story to tell you when you reach transaction management about Oracle's approach. Good. So what I'm going to do next, I'm going to show you an example of a concrete database, which is the IMDB movie database. And this is what we are going to use in our first homework. I suppose how many people here have visited this website, the ever-visited IMDB movie database? Most in this room, nobody from Microsoft. Microsoft doesn't. Some people. So it's a rich collection of information about movies. It turns out you can actually download this data. It's hidden somewhere there. They don't make it very easy. But you can download it. And after some cleaning, you can store it in a database. And this is what we did. And we got about 800,000 actors, 400,000 movies, and some directors and a few tables that we now store on SQL Server. Your first homework will ask you to answer nine specific questions about movies and actors stored in this database. So let me show you a little bit how this data looks like. In any relational database, including in this example, data is organized in tables, in flat tables. For example, we have a table called actor that has four attributes or four fields. The actor ID, this is the key. The first name, the last name, and the gender. My R wrapped around here. For example, Tom Hanks, male. And this is his ID. It's very simple to read. Think about this table. It's huge. This is like 800k actors. Similarly, you have movies like Toy Story made in 1995. We have an extra attribute here called rank. And I don't know what it's supposed to do. Most of the entries are null. So we don't use rank in our homework. But you will see it in the database. This also is huge. It's about 400k. But now, how do we found out which actors played in both movies? Well, for this, we have another table that connects them. And this is a table cast with an S at the end. Can't use cast because it's a reverse serve keyword. So cast that simply tells us for every person ID, like the ID of Tom Hanks, in which movie it played. So in terms of size, how do you expect cast to be compared to actor and movie? Smaller or bigger? I just want to see that you're on the same page as I. Should it be smaller or bigger? Most bigger. Bigger. Much bigger. Every actor played in some movie. Every movie had some actor. Probably have many, many relationships here. And cast is bigger. So this is how the database is organized. This is a relational database. Let's see how we query it. We query it in SQL. Here is a simple example. We say select star. This means give me all the attributes from actor. And when we run this query in a couple of minutes, we will get all the actors. We can also count. So getting back here. In this example, we get about 800k answers, or the actors in the database. You probably don't want to look at all of them. Another way to query the database is to aggregate queries. Select count star gives us a number of actors. And I will show them to you in a second. Here is a slightly more complicated query. This finds the actor whose last name is Hanks. Probably you don't know its ID, unless you remember what I wrote on the previous slide. But this is how you can find the actor by his last name. And an even more complicated query. This is how we find all movies where Tom Hanks plays in 1995. Well, how do we do this? Well, we need to look in actor. We need to look in movie. And of course, we need to look in cast. Talk to that. Well, the last name should be Hanks. The year of the movie should be 1995. And in addition, this actor must have played in that movie. So we say that the actor ID is the same as a person ID and cast. And the same cast movie ID is the same as the movie's ID. I know I went a little bit quickly, but I suppose most of you have seen SQL and some for more than one another. We will discuss this to some greater depth. But right now, I want to show you a demo. So Fred, would you mind switching, please? OK, so what I will show you now is how to connect to the database system to SQL server and how to run simple queries on the IMDB movie database. So what you need to do, you need to start running, or you can see here with some delay, you need to run the SQL server management studio, which you need to find on the programs. Yes? Is that the client that we are supposed to download from MSDNA? This is a client that you're supposed to download. What I'm doing right now, I have a remote desktop connection to a Windows machine called ARIA, I think, that has this installed. I'm not using Windows, I'm using a Mac, so I have major difficulties to connect to the database system. I can only do this through a remote terminal. I hope, how many people here do not use Windows? This will be a problem. I can't see actually what happened to Microsoft. I suppose nobody raised their hand there. So what can you do? Well, obviously you have VMware. You can run Windows in emulation. The other possibility is for you to do exactly what I'm doing now. Remote desktop to ARIA or AQUA. You have access to one of them. I don't remember which one, ARIA or AQUA. Yes. AQUA? No, no, no. I was wondering, ever since I learned that the name of the server was ARIA, I was wondering whether it was related in some way to the series of the same name. ARIA? I'm wondering if it's actually based on offer terms. I have no clue how they came up with this name. This is supporting for the function. It has been there for many years. I don't watch the TV show, but I can tell you that ARIA has been around for at least five, six years. It means ARIA. Sorry? It means ARIA. It means? Yeah, yeah. ARIA means water. OK, so this is what you need to do then. One way or another, you need to run the Windows, and you need to have this client installed. So you launch it. And now we need to connect. Here, when you connect, you type the name. Can you see? You type exactly this name, I approach SRV. Or if you do a remote connection, if you do tunneling, then you need to follow the instructions from tunneling. I think here you type a fixed IP address if you do tunneling. You will choose not Windows authentication, but you will choose the other authentication, the SQL server authentication. Because this is how your account is configured. And then you will type in here, which I can't see. You will type in your e-mail account. And then below the password that I will show you in a second, not in a second, like a more like in 10 minutes. And then I would ask you to change your password immediately. Once you type this in, then you connect. And that everything goes well. This is what you see. There is a delay on the screen compared to my tablet PC. So you see on the left this hierarchy of objects. The top one is of interest to us. These are the databases. So if you expand it, you will see IMDB. This is the IMDB database that we are using. Further expand it. And you can see tables. And if you further expand tables, you can see the six tables that we have in this database. I think we use all of them in the homework. Good. So now in order to run queries, you open a new query on the top left. You open a new query window. And now you can start typing queries, which I'm going to do right now. So start by saying use IMDB. And then F5 actually executes this. So now it knows that IMDB is the database to be used by default. I'm going to erase this. So let's start. When you first talk to a database, the first thing you would like to see is what's in there. You can also find out what tables are there. In the case of SQL Server, it's easy because these are the tables on the left. If you're running Postgres, there is a command I think it's called backslash dt that shows you all the tables. Once you see the tables, you would like to see their schema, their attributes. For example, how do we find out the schema of actors? We say select star from actor. And we run the query. And it runs. We got the first answers pretty quickly. But if you see it's the bottom there, it's still executing. And it's still executing. And it's still executing because there are about 800,000 actors that need to be displayed. And that's what takes a long. It finished. And we have, is this correct? Can you reach here at the bottom? Can't parse it. It says 1,800,000. OK, I was expecting 800,000 rows. I don't know what. Do we have right access to the database or only read access? You have only read access to the database. If you're, I have a spare copy. But please don't try to modify it. If you lose it, then it's going to take a few days to recover it. It's not trivial. OK, so a simpler way to find out how many actors there are is to replace star with count star. And again, you can execute by pushing the execute button here or F5. So let's push the execute button. Oh, I was wrong. There are not 800,000, but there are indeed 1,865,000 actors, so more than 800,000. Good, so let's see something more interesting then. Let's see the query that we just discussed. Actually, I see how many movies there are. This is movies or movie? 1,500,000. Let's see how many casts there are. This is 11 million, right? If I parse it correctly. The fonts are very small. Can you see on the screen? 11 million. So a pretty large collection of items. Now let's see the difficult one, the difficult query. So I want to find all movies made by Tom Hanks in 1995. So I need to go to actor, cast, movie, and then say this. Where? The last name is Hanks and the year is 1995. And. So now we need to do the join. We need to say the actor PID is equal to, it's not PID. Actually ID is equal to, actually here I don't need to type it's PID. And MIT, which I know is the attribute from cast, is equal to movie ID. So how do I know all these attributes? How do you know the attributes? Well, you simply type in select star from cast, and then you find the attribute in cast. And you do select star from movie, and you find the attribute in movie. Once you find them, then you can write the query. And we run it. And you know what? It's done. It didn't really take as a 12 seconds, it took to return the 1,800,000 actors. It took more like, oops, 0 seconds. It's below, is there an error? Let's run it again. F5, look, it's instantaneous. So what's the trick? What does SQL server do to go over 1,800,000 actors? And like 11 million casts, and so many million movies, and find the right matches. What is your best guess? What is the technology that would help us with it? Yes? Indexes on tables? Indexes? Is it cash? Maybe cash, cash. Would the cash help? On the second one. It would help it on the second one. But not the first one. And the first one was, well, we did run, we did read all the actors. So that might have helped. Those might have been in the cache. But trust me, this query takes under 1 second, even with a cold cache. So this is quite amazing. So the technology, the database is used to answer query so fast. Our indexes, query plans, they select the right way to answer the query based on the structure of the query. And database statistics. So let me show you a little bit. Let me show you the query plan first. To see the query plan, and this is something that we will discuss at some length, probably towards the middle of this quarter. So to see the query plan, you click on this item, on this button. And now you can see it's actually very hard to read on a small screen. You can see here what we call a query plan. This tells us the order in which this query was executed. So it starts here with an index seek. It actually says an index seek. And then it does a nested loop join. We're going to discuss later what this means. And then it's going to do a key lookup, again on an index, and so on. More joins, more index lookups. So the optimizer inside the database engine needs to choose this plan. It needs to decide in which order, in what order to execute these operations on the tables in order to answer the query and do this fast. So let me show you a more simple query so you can really appreciate how difficult this is. And I will show you a very, very simple query that finds all movies made in 1905. But I have difficulties with the return here. It's autocompletes, sometimes useful, sometimes it's not. So let me run a five. And we get all the movies made in 1905, not a major surprise here. Now let's look at the query plan. How do you expect SQL Server to have answered this query? But it's not a big mystery. It's going to use an index on the year. So let's see that index. And here it is. If you can read this, it says index seek non-clustered on movie year index. That's what it does. It does a look up on the index. And this is how we can find quickly all the movies made in 1905. So this was expected. But now let me show you something really surprising. I'm going to also look up the movies made in 1995. Let's run it. We got the results. Now let's look at the query plan. Well, it says here index, but this is not the index on year. This is not really an index look up. This is a regular table scan. So wait a minute. It's the same query. One looks for all the movies in 1905. And SQL Server decided to use an index. The other looks for movies made in 1995. And SQL Server decided it's not going to use the index. Why? And how could it make a difference? Yes? In large, if you can't live late, 90% of your average, you don't use the index. All right. But it's an answer that's too expert for what I was looking for. Was it because previously we already looked up on the movies that had the years 1995 in a read cache somewhere that just read users? It's not about caching. It's about the number of answers. So let me show you again the let me run this. Look at how many answers we get. Can you read at the bottom? 20,000. There are 20,000 movies made in 1995. But how many movies are made in 1905? 437. That's the difference. SQL Server decided that it's fast to access the index 437 times in order to answer this query. But it would have been a waste of disk accesses to accesses 20,000 times to find all the movies made in 1995. So an important component to the database management system is the data-based statistics. And the third component of a database optimizer, in addition to the search space and the optimization algorithm. And we will discuss this towards the end of this class. OK, any more questions about SQL Server before I switch back to the slides? You have a question over there? OK, Fred, can we please switch back? Good. So I want to give you a glimpse of the query plans. And for that, I want to ask you, what are the possible ways to answer the query about Tom Hanks on the three tables that you see here on the slide? So again, the query is find or movies made in 1995 by Hanks. And in addition, the ID from here must join with the PID from cast. And the MID must join with the ID from movie. How would you answer this query? With indexes, with everything you have? You're looping over the records. You can loop over the records. Let's start from where do you want to start looping? You see, there are two choices. We can loop over actor first, or we can loop over movie. Let's suppose we loop over actors first. So we look at all the actors, and we can skip all of them, except those that are called Hanks. Once we find a match, what do we do? So now, we are looking at this record. What do we do now? Join on the cast. We jump. Now, we need to be a little bit more low level. What exactly do we do? Sure, we want to join. But how are we going to do this join? We're looping from action movie. Perfect. So now, we loop over cast. We look for a matching ID. Once we find this matching ID, so I'm looking at this record, and I'm looking at maybe this record, and they match. Now, what do we do? MID, then match against the movie. So now, we iterate over the movie, and find the matching or the matching IDs here. And when we find them, then we can return that up, and continue the nice loop. How can you speed this up? Do one pass through actors, collect up all of the IDs for the actors with hands. Do one pass through movies, collect up all the IDs for movies with 1995, and then do a pass through cast, first matching against the set of actor IDs and movie IDs. That's a great idea. So first, collect all the actor IDs that match, then collect all the movie IDs that match. I'm going to think it's a database system to do this. I'm sure there is a way to express it in a plan. And then, find and cast all the pairs that match. Actually, no, they don't do this. And I can tell you why. Because how many matches of Hanks do we have? We actually run this query. Do you remember? 400. So there are about 400. How many matches of movies in 1995? 1,000. Sorry? 1,000 to 400. Movies in 1995, we had 20,000. We had 400 in 1905. So if we were to collect these two sets, which are much smaller than the actors and movies, then we need to compute that Cartesian product and then check for every pair if it's in cast. And probably if you do the math, it's not beneficial because the numbers are high. If they were lower, then that would be a good plan. So what I want to show you now, I just wanted to initiate a discussion on the choices that the database system has in order to answer the query. We represent these choices much better than what's here. This symbol is wrong. It should be a join, and it is a bow tie. That's a join. So this is how we represent plans. And of course, SQL Server has a more professional graphics display of the plans. But in textbooks and on the slides, we represent plans this way. What this means is you start from actor. You find all actors whose last name is Hanks. Then you join with cast. At this level, we think about join, which could be a nested loop join, which means we will iterate over cast and find the matches. Or it could be something more clever, like using an index lookup. Maybe we have an index on cast PID. And then you will use an index lookup. Then this would be an index join. And then the results. We join them with the movies that are made in 1995. Or we can iterate over movie first and find the cast. And then the results will be joined with actor. Two different plans that the database optimizer might consider. How would you represent the other plan? I'm kind of jumping here like six lectures. But somebody in this room proposed another plan, which was very interesting. Namely, iterate over all the movies, collect all the movie IDs made in 1995, iterate over all the actors, and then take the Cartesian product and look up if they appear in cast. How would you represent that? What diagram would be cast and find? So we start. There is actually a technology to do this nicely. So we start at actors. And we do our selection of last name equals hanks. And then we look at movies. Do our selection of year equals 1995. And then what do we do with them? We join them, which essentially means Cartesian product. We join them on nothing. We just do a Cartesian product. And then the result, we do join with casts. Some database optimizers, they do not look for such plans that have Cartesian product. But sometimes, actually, they are beneficial. Sometimes this Cartesian product is smaller, very small. Very small compared to the table is going to be joined against. And some database systems do search. Actually, I don't know if SQL Server considers Cartesian product or not. Usually, the evolution is to do consider them. So that was my overview of database optimizers. So they use indexes. They use query optimization. They use statistics. And they use different physical implementations for the various operators. This is one thing. The database management systems do well. Here is another thing that they do very well. That's something you might remember from Philip Greenspun's book, SQL for Nerds. They hand the recovery. And actually, this is an example that's quite similar to what Greenspun had in his chapter. Imagine writing an application that moves money from one account to the other. And you can write it in Java or in Python or in C++ or in C Sharp. But it essentially looks like this. You read the amount in the first account. Then you subtract 100 here. And then you write this back into the first account. This is the right. And then you go to the second account. And here you add 100 and write it back. And it's a perfectly correct program because the sum of the two accounts is constant after you finish running the program. But something very bad can go wrong here. What can go wrong? Yes? Crash in the middle. Crash in the middle. Right here. If the system crashes, then we're in big trouble. Actually, why are we in big trouble? The files are still there. You can simply read the files. You have the files with the accounts. So why not? What happens? How do those files look like? And why do we say that we are in big trouble? Those two are not atomic. Only one of the unpart of the transactions occurred. Yeah, it's not atomic. That's correct. But think more naively. Think about examining the pile of accounts. And you see account one. There is an amount. You see account two. There is an amount. And you don't know where it crashed. You don't know if the transfer happened or if it didn't happen. And you're in big trouble. I mean, you would need to contact the customer. But of course, in a database system, you don't want to do that. You want to ensure that the entire transaction, the entire sequence of instructions, is either completely executed or not at all. So that was recovery. Here is another functionality that the database systems offer. And that's illustrated by this example. Here is an application that allows you to withdraw money from your account. $100 at the time. It's a very simple application. The only interesting thing here is that it needs to check that you have $100. And if you don't have it, then it's not going to dispense money. Otherwise, it's going to dispense money and you can take some money and go away. And the same application, of course, needs to be run on every ATM that dispenses money. So now here is how you can overdraft your account. You run on two ATMs. You try to withdraw money at the same time. And you don't have $100. So what can happen? You can go past the statement. And then at that time, both of the amounts are above $100. However, if you have $100 and you're inside each statement, then the ATMs will dispose $100 each. So we have $200 disposed. Exactly. And I actually misspoke. So you corrected me well. So I assume that you do have $100, but you don't have $200. You only have $100 or $101. So then both programs can get to this point. Both say, yeah, there is enough money. Once you pass this point, then both dispense money and you're going away with $200 while you did have $200. Yeah, very bad. This is something that database systems can prevent. And they do this through concurrency control, which is something we will study in great detail in a few lectures. So transactions are a concept that were invented in the framework of databases. And it's a wonderful abstraction. A transaction is just a sequence of instructions that are supposed to be executed either or completely. So all their actions are reflected in the database or none is executed at all. And this property should be invariant no matter what happens to the system. If it crashes, if multiple copies of that transactions are executed on the same database and so on. At a more detailed level, transactions are said to have the asset properties. And assets stands for atomicity. Automaticity means either it's executed completely or not at all. Isolation means that the effect of this transaction on the database is as if this were the only thing that executed at that time. Even if you execute multiple transactions concurrently, the effect of each of them is as if that were the only thing executing on the database system at that time. Consistency is actually no longer considered a property that needs to be enforced separately. It follows from the previous two. Consistency says this, if at the beginning of the transaction your database satisfies certain constraints, like in every account the balance should be positive. You don't allow negative balances. Or the total sum of all accounts in this bank should be over 1 million or whatever. In addition, when you program, when you write the transaction, you are careful to maintain that constraint. Then at the end of executing the transaction, the constraint should still be satisfied. But how can it not be satisfied if we are careful to enforce it? It can fail to be satisfied in two ways. Either the system crashes at the wrong moment, or there are multiple copies of that transaction that execute and they affect interferes, and then they no longer satisfy the constraint. In other words, if your transaction is atomic and is isolated, then it will also satisfy the consistency property. Endurability has actually two interpretations. And in the literature, the one interpretation that I saw is just says that all the updates must be written to disk. But another interpretation is that it's part of recovery, in the sense that once you declare the transaction that's being terminated, then the system, even if it doesn't write them to disk, it should at least write them in the log, that's what it says. So that you can later, when you recover, you should be able to get to a state in which that transaction left the database. OK, so that was just an overview. We will discuss transactions in more depth in a few lectures. A final thought about the architecture of database management systems. From the 70s on, database management systems have a client-server architecture. That means that all the files that form the database, they are managed by this C program written by someone else that restricts all the access to those files. It only allows you to access the files using SQL. It enforces security. It guarantees concurrency control, recovery, and so on. So this server is usually run on a big server, or even on several servers. It can be a parallel database system. But it can also run on your own machine. It can be a Postgres server running on your machine or MySQL. Now in addition, there are many applications or clients that run some code. It could be Java, it could be C-sharp. And that from time to time, connect to the database to the server and ask for data, either ask for data or require the database to be updated. This is a client-server architecture. And usually, the protocol between the client and the server is either ODBC or JDBC. And we will not study them. They are low-level protocols about what goes up and down the wire, and they are not of interest to us. For the second homework, for the third homework, we will use JDBC, because we will use Java. Good. Now there are two ways, two major types of usages of a database system. One is OLTP. This stands for online transaction processing. In this usage mode, the queries are usually simple point queries, and there are many updates. Think about a bank. What does a bank need to do? It gets a customer's ID, and it reads its account. It's one access to one table or maybe to two tables. And it does an update. It allows the customer to deposit money or to withdraw money. It makes a point update. OLTP stands for online transaction processing. And the idea is that there are many applications that issue many simple queries, often with updates. And the issue is how do you handle efficiently concurrency control and recovery in such a system? The other usage mode is decision support. This is when we want to find out how the bank is doing. What was the profit last quarter compared to the profit a quarter before? What is the profit that we get from customers whose income is between such and such in this year compared to the profit we got from customers with the same income last year? These are queries that often do many joints that touch a large fraction of the database, and they have a different usage pattern of the database system. It's a good practice to have two different databases for these two different users. You could use the same one, but the performance requirements are quite different, and the physical tuning is quite different. So therefore, applications often keep them separate. The database that is designed and tuned for decision support is called the data warehouse. And I don't think we will have time to discuss data warehouses in any depth, but I will mention during query optimization some specific tricks that you need to do for decision support queries. Now, the trend that you hear often today is this movement, the NoSQL movement. And they have this sign as a SQL cross. And the belief is that something is wrong with a relational database management system that they are too slow. Usually, the reason why NoSQL databases are faster is that because they give up some of their functionality. They either don't handle concurrency control because the database system is embedded in the application, so only one application handles it. Or their transactions are one record at a time transaction, so they do handle concurrency control. But they don't allow you to move money from one account to the other. You have to do this in two separate transactions, and then it's up to you what happens if it crashes in the middle. And I'd like to have a short discussion next time. But for that, I would ask you to read this blog. It's two pages long. We posted it on the course website. It is by Mike Stonebreaker. You might have heard this name. Mike Stonebreaker was a professor at Berkeley since the 70s. And he was the author of Ingress that later became Postgres. He's now a young professor at MIT. He's extremely outspoken in terms of database technology. And his papers and blogs are usually a pleasure to read. This is a very short blog where he describes this trend, this NoSQL trend, and explains very clearly what is meant by that and what you're getting if you go that way. I'd like to have a short discussion about this next time in class. And the final thought, data management is a much larger area than database management systems. If you look at our major conferences, which are SIGMOD, BLDB, PODS, ICDE, very few papers, maybe one fourth or maybe one tenth, are about research related to the database management system, to the optimizer, or to query processing, or to transaction management. Most research is about other large-scale data management problems. And just to give you a flavor, I have a little quiz for you, which comes from stream data processing. And here is a quiz. Imagine two servers, two computers. Alice calls them Alice and Bob, like in security. And Alice sends Bob numbers. The problem is it sends Bob lots of numbers. It's not like 1 gigabyte worth of numbers or 1 terabyte worth of numbers. It's like 1 zillion numbers. But Alice is careful. She's going to send only distinct numbers. She's actually going to send all these numbers between 1 and 1 zillion, except one. One number is missing. And your task is to write a program that finds out the missing number. You don't have to write a program. You just have to figure out how to do this. OK? It's actually quite simple. Think about this during the break. Once you solve it, then make it a little bit more challenging. Now you're missing 10 numbers. So how can you find out at the end of this process which are the 10 numbers that are missing? How can Bob find out which 10 numbers are missing? OK, it's a data. It's a kind of a problem that data management worries about. Does Bob have a list of all the numbers that he received? That's a good idea. But what prevents us from doing this? So can Bob store the list of all the numbers that it receives? So my computers have about two or four gigabytes of main memory. So let's say two gigabytes. How many numbers can you store in two gigabytes of main memory? 10 to the power? Give me a power here. Mega is 6, giga is 9, right? So 10 to the 9. 13. 13? 12? So mega is 6, no? Mega is 6, giga is 9. 10 to the 9. What do I do wrong? 10 to the 9. Two bytes numbers, OK? Maybe you have a more beefy computer than is 10 to 12. Maybe you have 1 terabyte of disk, then you have 10 to the 12. But here, I hope I put more zeros. If not, then just add them. So that's a challenge. There is not enough space either on disk or on tape or on anything to store all the numbers. You still have to find the missing numbers. Can you store their sum? Well, you're giving out the idea. Then you do the next challenge. Do this one. That's a solution, yes. Of course, you compute that sum. And then you know what the sum should be. And then you find out which one is missing. Now think about 10. OK, this is not going to be our topic in this course. Our topic is going to be SQL. But I'm going to give you a break before we start SQL. But before that, here come my instructions on how to access SQL server. I should have put them earlier. So again, the name of the server is this. I approach SRB. We need to use SQL server authentication, not the Windows authentication. And the password is CSEP544, the exclamation mark. Once you log in, please change our password. And in order to change your password, you have to go to security, I think. You will prompt your password log. We will prompt you to change your password. Good. OK, so this finishes my overview. I think it's a good time to take a break. And in the second half, we will go. We essentially will recover the data manipulation language in SQL, except for updates. Any questions so far? OK, then let's take like five minutes. Are five minutes enough? Five minutes break? At Microsoft, five minutes is OK? OK, short five minutes break then. OK, so let's start then. We still have a lot of material to cover. But my plan is to skip over the simpler part of SQL so we can actually reach the more difficult part. And if you want me to go slower, please interrupt. Please ask questions. So here is what we will cover about SQL basics. So this is where I will skip lots of stuff. Then I'm going to move to the hardest part, which are subqueries. And this is hard because the right way to think about, let me restart this, this is hard because the queries that we express with subqueries, they are more naturally expressed in first order logic for what is called the tuple calculus. I will show you examples of the tuple calculus. These are expressions in first order logic. And I strongly advise you to read in chapter 4.3 because I don't have time to cover the tuple calculus in any detail in class. Then we will discuss aggregates. They are simpler, but you need to know aggregates pretty well coming out of this course. And then we will discuss nodes and outer joints. SQL. It's an old language. It comes from the 70s. The syntax is a little bit archaic. It's a little bit old. But the basic principle behind SQL is extremely important today, which is this separation between what we want to get and how to get it. It's essentially the notion of a declarative language. And we see this extremely well now in this debate about map reduce and how well suited is map reduce for large-scale data analytics. Well, it's not so suited. Bigger is much more suited. But bigger is still not there yet. Again, at the level of large-scale parallel data analytics, we are going through the same steps as database systems went in the 70s and early 80s from a procedural to a declarative language. SQL has two parts. There is a data definition language which allows you to create tables to update and to modify the database. We will not discuss this in class. So please read. You need this in homework 3. There is no reason to cover it. You just read the documentation or read from any book how to create tables and how to update. The difficult part is data. It's not the difficult. The interesting part is the data manipulation language, which is what you saw already, how to get data from the database. And that's what we will discuss next. SQL has an extremely simple data structure. Only flat tables. There is no nesting, no pointers, no trees, no lists, no graphs, no nothing. Just flat tables. Here is a quick and we will discuss next lecture why all the tables are flat and why we don't have pointers, no lists, no nesting. A quick terminology, we call these things either tables or relations, same thing, table or relation. These are called either attributes or columns. Again, it's the same thing. Every single entry here is called the tuple or a row, a row in the table. And every table in SQL must have a key. What is a key? We say that p name here is a key. And what do we mean by when we say that an attribute like p name is a key? It uniquely identifies a row. It uniquely identifies a row. Another way to say this is that if you look at the values in this column, all these values must be distinct. By contrast, if you look at the different column here, you don't have distinct values. You have some repeated values. So this is not a key. OK? Does it? But you can find a key as multiple attributes, yes. So you could have a key which is all of the other table. Right, sometimes no attribute is a key. But if you combine two attributes together, this is when you get a key. It's absolutely correct. As a good practice, you should usually design a single attribute that is a key in a table. But it's not required. SQL accepts multi-attribute keys as well. Good, data types in SQL. So what can these attributes be? Well, there is a rich list of data types that every SQL implementation supports. They are often incompatible between implementations. And these data types, again, they are inherited from the 70, so they are not aligned with the more modern, the better designed data types in Java or C-sharp or F-sharp. But we have to live with them. And whenever we need them, you just need to read the documentation. So we will not discuss data types. Simple SQL queries. Any reason to discuss a query here? What it does? It goes to product and finds, it reads only those products whose category is gadgets, and then returns all the attributes. So if you start with this product, with this set of products, then you get this set of answers. Same attributes, but only a subset, namely only those that are in the gadgets category. But this is called a selection. We select from this table, we select a set of rows that we want to return. Sometimes, we don't want all the attributes. We only want a subset of the attributes, and then we list those attributes in the select clause. And as a witness of the arcane syntax of SQL, this is actually called a projection. It has been called a projection from early on. I don't know why they put a select keyword here. It's bad. It's not the right choice of a keyword, but that's what SQL has. It's called select. It should have been called project. So in this query, we only return products whose price is over 100. So look at them. They have price over 100. But we only return three attributes, namely the product named the price manufacturer, not the big deal. More details. It's case-intensive, a legacy of the 70s. However, remember, the strings are case-intensive. Another legacy from the 70s. We use single quotation mark, not double quotation mark. Good. SQL is peculiar in yet another way. The tables are sets. I forgot to mention this, but probably you know this. They are not ordered. Collections in SQL are not ordered. The order is implementation-dependent. You should not rely on the order. However, the tables are sets. There are no duplicate elements in the table. But when SQL executes, the results might contain duplicates. And that's what you get in your application. You might get duplicates. If you want duplicates eliminated, you use a distinct keyword. So without distinct, you get, if you look for all the categories, then you get gadgets twice. If you use a distinct keyword, then you get gadgets only once. And I said that SQL is not ordered. I mean, the tables in SQL are not ordered. However, when you retrieve them, you have the option of asking them to be returned in a particular order. And that's what you list in the order by clause. So this means you order by price first. And at equal prices, you order by the product name. And order can be increasing or decreasing. And there is a keyword that allows you to control whether it is increasing or decreasing. Any questions so far? Please read this slide at home. It shows you the interesting interaction between distinct and order by. And try to figure out what these queries return. If you can't figure out a speak up, tell me next time or send an email. They have the problem is that this distinct interacts with the order by. And you have to think about what this means at home. Now let's move towards joins. And for that, I need to tell you a concept that we will study in more detail in the next two lectures, which are keys and foreign keys. We saw what a key is. A key is an attribute that uniquely identifies a record. Or a row. A foreign key is an attribute whose values refer to a key. So when we say that an attribute is a foreign key, we essentially make a promise that every value we store in that attribute will be present as a key in the table where it refers to. Sometimes people also call this a semantic pointer. It's like saying every product has a pointer to the manufacturer of that product. But this pointer is not a machine pointer, but it is a name of the company who manufactured that product. So think about this key small work as being a pointer right here. OK, that's a key foreign key. Now how do we query, how do we express in SQL a query that looks for that case is such a pointer. That follows a pointer. And we do this through a join. So I'm going to go actually a little bit faster through this. I think you've seen this at some point. So this query finds all products under $200. So the price has to be under $200. Manufactured in Japan. So the country and the different and the other table should be Japan. And here manufacturer is a foreign key into the company name. So the query does exactly this. It iterates over product and over company and introduces the right selections. Country is Japan and price is under $200. And then it has this join between the foreign key and the key. And here is a picture to help you conceptualize what's going on there. It connects. It does this key foreign key join between manufacturer and company. And in addition, it selects only those that are in Japan and only those whose prices is over. It actually should be here with the US red. Whose prices over under $200. And returns the result. Should we spend more time on this or can we move on? Good. A matter of syntax, you can often refer directly to the attributes of the tables. But sometimes if you have ambiguities, then it's good to introduce two double variables. Let's look at the last query here. So we join person with company. But we give person a name. It's called X. And we give company a name. It's called Y. And we say that. And then when you refer to attributes, you can specify whether it's X or Y. That's called the tupper variable. And as a shortcut, we can use the name of the relation itself as a tupper variable. Actually, I do want to go over this example here. It says this, find all Chinese companies that manufacture products in both a toy and both, something is wrong. I think that this slide is the right slide. So find all Chinese companies that manufacture products both in the electronic and the toy category. OK, let's work over this together. So I want the category to be, well, electronic and toy. And I want the country to be China. So let's write this query. Clearly, I want to go over product. Let's call it X because it's easier to refer to it. As you can see, we have an option of putting as, the keyword as, or not including it. And I'm going to go over company Y. And obviously, here I'm going to insist that X dot manufacturer is equal to Y dot CNAME. And I'm going to write N like this, it's shorter. And the country is equal to China. And we're stuck. Now we are stuck. Because now, we need to check the category. Well, we want all Chinese companies that manufacture products in the toy category and in the electronic category. So what am I going to test now? And categories. OK, so X dot category, let me actually use, let me use right here because this is where it gets hot. And X dot category equals or in? In. You want in. In is permitted. In and then you can figure out a collection. So it's either this is in toy, comma, electronic. OK, so now let's discuss. Is this a query that we want? Yes. Because it'll store Chinese companies that only make toys or they only make electronics. Exactly. This will return Chinese companies that only make toy products. Yes. Either toy or electronic, it's not. It must be both electronic and toy. So it has to be both electronic and toy. Yes. I'm bringing another copy of product. Right. So the current from clause, we cannot answer this query. Because you have a single product, we are currently looking at a single product X. A single product X made by this company cannot be both in electronic and toys. We need a second product. Let's call it Z. And this one of the products will be one of the products X category is going to be electronic. And the Z category is going to be toy. And of course, we also need to join. So Z dot manufacturer equals Y dot C name. So if you are not an expert at SQL yet, you will be soon. But this is one query to look at. Sometimes you need to iterate with two or even more two variables over the same table. And you need to recognize when that's the case. In database theory, we call this a self-join. It's a self-join because product eventually joins with product. You have to join product with itself. And it makes the theory harder in certain cases. It makes it much harder when you query probabilistic data. These self-joins are a nightmare without self-joins as much easier. Good. So now I can give you the former semantics of SQL. It's simple, but it's important that we go over this. So if we have a SQL query like this, this one here, which says, select from tables R1, R2, Rn, select all those entries that satisfy some conditions, and return the attributes a1, ak. What this query means is the following. It's like iterating over all the x's in R1. For each such x iterate over all the x's in R2. And for each of them, so on, iterate over all the x's in Rn. And now that you have a handle on a tuple from every single relation, check the conditions. And if these conditions are satisfied, then you add your current tuple to the set of answers. And if not, then you ignore this combination of tuples. And you continue. This is the semantics. It doesn't mean that this is how SQL has to be implemented, but this is what every implementation must return. Something that is equivalent to this semantics. Sounds very simple. And this is why I prepared a slide to show you that it's not that simple, but my animation doesn't work. And this is my standard. So let me go slower a little bit. So tell me this. What does this query return? Can you speak up? OK, let me ask you distinct. My intention was that A is a single attribute of the relations R and S. Think about A as being the key of R. So I actually shouldn't have distinct. What does this compute? It itself was total. Intersection of R and S. This is what it computes. R intersected with S. It finds all entries that are present in both R and S. Now let's look at this one. R, S and T, where R dot A is equal to S or T. Or R dot A equals to T dot A. This R intersection, what? R intersection S union T. Correct? No, this is wrong. Doesn't return this. It should return this. I mean, this is what it seems to return. But there is a major subtlety here. And you can discover it if you think about the former semantics. It'll return duplicate A's if they're both S and R. It will do return duplicate, you're right. But that was not my point. But thank you for reminding me. I should insert here distinct. So now it doesn't return to duplicate anymore. But it still doesn't return this result. What happens if T is empty? Well, what is this expression R intersected S union T when T is empty? It's just R intersected with S when T is empty. If, however, think about the nested loop semantics of SQL. What happens to that semantics if one of the queries is empty? It's right here. Then everything is empty. If R2 is empty, it doesn't matter what happens to the other. Then the result is empty. And that is exactly what happens in our case. If both are non-empty, if both S and T are non-empty, then this query returns this expression. R intersected S union T. But if one of them is empty, then it returns the empty set. And the only way to understand this is if you remember this boring formal semantics on the previous slide. OK, so these are the simplest queries as election projections joins. Joints introduce duplicates, but we are not going to skip these slides. Now let's look at more complicated SQL queries, namely SQL queries with sub-queries. SQL initially was a very restricted language, but over the years it has evolved. And now all major implementations, and my SQL is one of the last ones, support sub-queries. And the sub-query may occur in a select clause, in the from clause, or in the where clause. And the interesting usage of sub-queries are here in the where clause. So let me show you queries I'm going to skip over these and show you queries in the where clause. And I'm going to start with this simple example. So we have products manufactured by companies. There is this foreign key. Every company and every company attribute and product refers to a key and company. And I want to find all the cities, or these cities here, where there is a company that makes some products under $100. This is an existential quantifier. I want to find the cities where there exists a company for which there exists a product that is under $100. And it is very easy to write. We can actually write this with a nested sub-query, which I wrote here. So let's read it. We go over company, we return the city. And the condition that we check is that there exists a product that is made by that company. This is the key foreign key joint here. And the price is under $100. Notice the use of the exists keyword. This is a predicate, which I'm going to apply to a sub-query. Simply checks that that sub-query is non-empty. If that sub-query returns any products, then the condition is true, otherwise it's false. Yes? So how does this map onto the previous slide of the formal semantics of SQL queries? It's a good question. So I only showed you the formal semantics of flat SQL queries without nesting. How would you describe the semantics of a query like this where we have nested SQL queries? Yes? Multiple evaluations of that same thing where the sub-query is turned into conditions in the outer evaluations? Yes. So do the same nested loop. We apply the same. In this case, it's a single loop over company. And when it comes time to evaluate the condition, this is when you apply recursively the same semantics for the sub-query. Yes? So does Exist allow it to shortcut and stop looking for data after the first result is returned? And is that great? This is an optimization that the optimizer might choose to do. And I suppose most of them do it. It seems to be a simple optimization. I'm trying to think if there are any catches here, but it seems obvious that that's what they should do. OK. Now here is a way to think about this. This is not a Java program. When you write a Java program, you don't care so much about counting the number of loops. You care about the complexity of the program. But in SQL, this needs to be optimized. Optimizing nested queries, optimizing sub-queries, is much more difficult than optimizing straightforward select-from-ware. Even straightforward select-from-ware queries, they are difficult to optimize. You need to worry about join orders, about physical operators, about statistics. Once you throw in sub-queries, it's even harder. Many database systems don't do a good job at all at optimizing sub-queries. In fact, MySQL doesn't optimize them at all. I suppose a towel. I don't think Oracle optimizes them in any meaningful way. Only DB2 and SQL Server do a really good job at optimizing sub-queries. Postgres does a decent job, not a fantastic job. Bottom line, if you can, don't use a sub-query. Can you write this query here without a sub-query? Can you write this query that finds all cities where there is a company making products under $100 without any sub-query? Use a join. Use a join. And I hope I have it right here. No, I don't have it here. I need to write it. So let me write it. Select, I need to remember, so it's called city. Let's put a distinct, because otherwise we are going to get duplicates from product and company where price is under $100. And the join happens, and the join. OK, nevertheless, I'm going to show you the value space in which you can write this query using sub-queries using different syntactic constructions. What we see on this slide is the use of the exists keyword. Here is another keyword, which is in. In, it's like set membership. So here we check that the company name, which is just a value, is in the set of all companies manufacturing products whose price is less than $100. Exactly the same thing. And let me see, do I have another one? No, this is, oh, yeah. And another keyword, so this is a third way to write a sub-query, is using any. Any allows us to compare one value with a set of values. And it says check if $100 is greater than any of the prices of products made by our company. Right on my emphasis is on the syntax of SQL. Here are three different constructs to express the same thing. So one is exists, then we have any, and we have in. Sorry? Can we use how? Did you lose a connection? Something is wrong. Fred, we seem to have a connection problem. Oh, good. I think I switched too fast between the slides. OK, my point is that the next example I'm going to show you is going to use a universal quantifier, and that's much harder. So to warm you up, I want to express the same query using the relational, using the tuple calculus or first-order logic. But I want to take a question. Is there any difference in what you can express with any and exists because it seems like there are always more conditioning to exist and make it the condition of any? Any and exists. The question was, is there any difference? Logically, they represent the same concept, which is that of an existential quantifier. Syntactically, there are different constructions. Exists applies to a subquery, and it checks if it's empty or not. Any is a comparison, and checks if a value is greater than any of the values returned by a subquery. So syntactically, they are different. Conceptually, they are not different. They are both different ways to express an existential quantifier. So talking about existential quantifiers, how many people here know the syntax of first-order logic? Oh, we are not doing well. I was hoping most of you would raise your hands. First-order logic. This is, say, the fundamental of mass and massics. And it's extremely simple language. Let me show you two. You must see first-order logic at some point. First-order logic. This is a building block of mass and massics. And it's done using the following simple constructions. You can have an atom. Actually, let me raise this. So this is first-order logic. So you can have a relation symbol. And here you have terms, which can be variables or constants. This is one formula. Or you can have the conjunction of two formulas. Or you can have the disjunction of two formulas. Or you can have an existential quantifier. Or you can have a universal quantifier. This is all. This is first-order logic. And it underlines the fundamentals of mass and massics. There is something called second-order logic, which also allows you to quantify over sets. But the most important object of study in logic, in mathematical logic, is first-order logic. And there is. By the way, there is a wonderful book called Logic Comics, which I strongly recommend that you buy and read. Logic Comics. It's a comics book, so it's in the strips, about the history of logic. And it's written by Christos Papadimitriou. He is a professor at Berkeley. He's one of the biggest theoreticians, the computer science theoreticians that are out here. And this is a wonderful book about the struggle of mathematicians to define this fundamental of the foundations of mass and massics, which is first-order logic. It's a wonderful half-a-day read, absolutely charming. So getting back to our query, here is how you can write the same query in first-order logic. Did it freeze again? Fred, I think it froze again. Yeah, I'm working on it. I think it's just a local machine issue, but I'm in contact with Matt. He doesn't see any big problems, I don't think. OK, but that image froze. Yeah, I think it's just our machine that's not explained fluidly. I'm working on it. OK. So at Microsoft, can you see the slides at Microsoft? Yes. OK. So now we're looking at how to express the same query in first-order logic. But you see here is a formula where there is a single free variable, which is a city. We're looking for all cities where there is a company that manufactures some product under $100. So why is free? And sometimes we list it separately just to emphasize that this is a free variable. And here is how we express it. We say, check if there exists an x that's a company name, such that x, y belongs to company. And there exists a product named z and the price p, such that z, p, x, same x belongs to product. And p is less than 100. My point here is there are lots of existential quantifiers in this expression. And it expresses exactly the same thing as SQL. This is not a frivolous observation. The connection between logic and relational databases was a starting point of relational databases. That code in 1970 proposed to use first-order logic as a fundamental for storing data and databases. And that was the starting point of the relational data model. And this observation, that queries correspond to formulas. Now back to our discussion. We're going to move on to the more complicated query. So we saw this unnested using a universal quantifier. Find all cities where there is some company that manufactures a product, then manufactures all your products under $100. Yes? Go back to the slides, I was wondering, why are we binding the C variable? If you go back to the first-order logic slide. Yeah, I'm trying to go there. I'm going to be smart, but here it is. Why do we need to bind Z? It's just a matter of syntax. The question is, why do we bind Z? The point is that in this variant of the logic, which is actually the only variant used in mathematics, you must use three variables for product. Because product has three attributes. So you don't have a choice. You can't leave this field unoccupied. Some people, including myself, like to write this sometimes as is, like putting here an anonymous variable. But you can't leave that field unoccupied. So have you been seeking being a free variable? If Z is a free variable, if I drop this existential quantifier, then I'm going to return Z. What do I get? What is the meaning of this query? If I do this, yes? What is? Key name is primary key on product. But it returns something. So it's a new query that returns something. Can you follow it, Microsoft? The question is, what does this query return? And let me erase my superfluous here. So I removed the existential quantifier, Z. That was a proposal from the room. And I'm going to return now Y comma Z. Who can read in English what the new query does? Yes? It returns unnamed products for all product names that are less than $100 and the cities that they're made in. Exactly. It returns all the products under $100 and the cities where they are made. That's what it returns, which is not what we want, because the same city might be listed now many times, once for every product made in that city under $100. So now let's move on to the universal quantifier. And the query that I want to study here is find all cities with companies that make only products under $100. And I would like to first express this in first logic, because here it's very easy to express. Here we have quantifiers. The point is that SQL only has existential quantifiers by default, and it's harder to enforce universal quantifiers. Here's how we write it. Find all cities Y for which there exists a company X, X is a company in the city Y. And now we say for every product, for every way in which you can fill out the product name and the price, if you can find a product with this price manufactured by this company, then it must be the case that the product, that the price is less than $100. So now we would like to express this query in SQL. And for that, the trick to write these queries in SQL is to use negation, because the negation of a universal quantifier is an existential quantifier. Let me show you the rules that dictate this. They're right here. So the rules that we are going to use to apply negations are the following. The first two, you should recognize the Morgan's laws. It says that if you compute the negation of a and b, this is not a or not b. And if you want to compute the negation of a or b, this is not a or not b. And not b. Thank you. The slide is correct. I misspoke. Now let me ask you a question. The universal quantifier behaves like one of these two connectors, either or or end. Like which one? The universal quantifier, which is this, behaves like which connector, like an end or like an or? End. Like an end. This is like a universal quantifier says for every value of x, check that this predicate is true. And this is true. And this is true. And this is true for every single value of x. And the existential quantifier behaves like an or. And therefore, when you apply the Morgan's laws to the universal and existential quantifier, you get this. The negation of for all x, b of x, is the statements that exist x, such that not b of x. And similarly for the negation of the existential quantifier. So now let's look at our example here. So this is a query we started from right here. So I'm going to write this piece, which is the problematic universal quantifier using the negation. So I'm going to negate. And of course, I'm going to push the negation inside. And the negation of an arrow, I wrote it here. What is a, r, o, b? Do you remember this from logic? From vision. It's the same as not a or b. How can a, r, o, b be false? But obviously, if a is true, then b has to be true. But what happens if a is false? Then it's true no matter what b is. If we say whenever it rains, I take my umbrella. But if it doesn't rain, I can still take my umbrella. Doesn't contradict the implication. How can you make a, r, o, b false? When is it false? It's false only when a is true and b is false. So it is false every time when either not a is true or b is true. And when you negate it, you get n. You get a, actually, this is what we just said. In order to make it false, you have to make a true and b false. This is what we wrote here. So getting back to our example, when you push the negations through this sub-expression, universal quantifiers become existential quantifiers, and arrow becomes n. And of course, the price reverses the sign. So this says, find all the companies such that there is no product made by that company whose price is over $100. Now it makes sense. There is no product over $100. All the products have to be less than $100. So now we can express this using difference. It's like finding all the companies first, or possible, actually, or possible cities, and then subtracting those cities where there is a company that manufactures some product over $100. In the letter, of course, we know how to write in sequence because now we only have existential quantifiers. Definitely something to study for you at home. Let me show you quickly how we write this in sequence. Here is one possibility. Start with the negation. Start by finding all the other companies, all the companies that manufacture some product over $100. And we know how to do this. Here is one possibility. Return all cities where the company is in the set of companies making products over $100 and close parentheses. And then take the negation. So instead of in, you write not in. You might ask, why don't you write not in parentheses in the whole expression? And again, it's a legacy of a syntax developed in the 70s. And that's what they came up with. SQL does not have the most elegant syntax. OK, here is another way to write the same query. There is no for all quantifier in SQL. There is only exist. But we can use not exist. And it looks like this. Now it should be clear. If you drop not, then this finds all companies or, sorry, it finds all cities where there is a company that manufactures some product with prices over $100. When you put not, you get your universal quantifier or the cities where all the companies make only products over $100 or less than $100. And finally, we had any. Now we do have a universal quantifier that's called all. And here we, the way to read this is this checks that the value 100 is greater than all the prices that you obtain by running this query, this subquery of products manufactured by that company. Yes, a question. So you talked about all three of them kind of reducing out of the same logical statement. But this one seems to imply a lot more work. It implies that it's actually looking at every single price. Is that true? Are we influencing how the execution plan is done by the way that we phrase it? This is a very deep question. So does it matter how we write the query in terms of execution? And the principle of declarative programming says that it should matter. That's the job of the optimizer to figure out the most efficient plan. But again, keep in mind that the subqueries are kind of the toughest thing for the database optimizer. Probably, if you try this on SQL server on DB2, it might not matter. They might be able to produce the same plan for all three. Try this on MySQL. And I guarantee you it will matter a lot. And then we can discuss which one is more efficient. Yeah, this one is probably not that efficient. The last one I wrote. However, there is another subtlety that we need to pay attention to. And not all these queries are necessarily equivalent. And let me just quickly take a look and see if it makes a difference. No, all these queries are equivalent. But we have later examples when there might be subtle differences. When a subset is empty, that's what you have to watch for. Yes. But in this case, all three queries are equivalent. A good optimizer won't care which one of them you type. Less performant optimizer will care a lot. And some of them will be much less efficient than others. Good. But now, of course, we want to unest the universal quantifier. Because if you can unest it, then you're in good shape. And my question to you is, how can you write a previous query? This one here. How can you write it without nested sub queries? Like a simple select from where SQL queries. Think SQL query. Can you do this? Can you use count? Can't do it. Can you compare the counts? You can use count. That's correct. Once you introduce aggregates, then there are ways around it. But without aggregates, because they also have their own perks. It's a simple join. Joints and selections. Anyone venture to make a bold statement. Can we do this? Any guesses? Is it possible? He's a habit in class. It's pretty easy, but I think you're not looking for that. No, because that again uses aggregates. No, the answer is no. It is not possible to express this query using only standard select from where query. And let me explain you why, because the reason why you can't do this is interesting. And for that, I'm going to define a monotone query. So all the queries that we discussed so far and all the queries that we will ever consider, they are functions. They take some input tables, the tables in the database, and they return a new table, which is the table of answers that it returns. I'm going to call this query monotone if it has a following property. Whenever you increase one of the tables in the input, the answer cannot decrease. The answer must continue to have all the old tuples, maybe some more tuples, but none of the old tuples should disappear from the answer. Then it is called monotone. Now here is one simple fact, this one here. All the select from where queries, without aggregates and without sub queries, all select from where queries are monotone. Do you agree with the statement? Here it is. Here is my semantics. Well, think about the formal semantics. Suppose you increase R2, which is underlined. If you add more tuples to R2, can you ever lose an existing tuple from the answer? No, because it will continue to iterate over all the old tuples. It will treat them the same way it was treating them before. And it might add more tuples to the answer, but it will never drop tuples when you add a new tuple to R2. So all the queries, so all the unlisted queries are monotone. Now let's think about this universal quantifier, this query here. Is this monotone or is it not monotone? Find all cities that make some product under $100. No, but they see some. I'm on the wrong slide, sorry. It's not some. This is wrong. It should be only. So on this slide, some is wrong. Are you confused? This is a query I want to look at, and I just wanted the English sentence for it. So find all cities with companies that make only products under $100. Is this query monotone or not? Why not? Because you can always add a product to the city because there are 100 and then the universal quantifier will vary. Yes, so you can always add a product with a price over $100. And then you will lose a city in the process. So bingo, here is our statement. Therefore, that query cannot be expressed as a single select from where query. You must use something else, either a subquery or an aggregate or something else. And that concludes my discussion about nested subqueries. The rule of thumb is whenever you can, try to avoid subqueries. However, if the query is not monotone, then rest assured that you can't avoid nested subqueries. Or you can avoid them by using some fancy aggregation, some fancy way of using aggregates. OK, so we still have two more topics. This is an exercise for home. Try to read these queries and try to express them in SQL. We are a famous example from one of the earliest and one of the best textbooks on database systems written by Jeff Orman in the 80s. It's an example with drinkers, bars, and beers. And you have to answer things like this. Find all drinkers that frequent only bars, that serve only beer, that they like. Look at these queries at home, and you will enjoy them. And try to express them in SQL, and you will hate them. And that's a learning process. Good. The next topic for today is aggregation. It's easier, but I do want to show you. I want to draw your attention on using group eyes as opposed to subqueries. That's my main goal. Let's start with simple aggregations. Count star. This says go over products. Select only those made after 1995. Count them. Return just a single number. So unlike the previous queries that return a table of answers, here we return a single answer, which is just the number. Account. We also have there are five standard aggregate operators from the early versions of SQL. Now all more than implementations, they support many more. But the only five that we are going to use are sum. Sum, count, min, max, and average. Another example, notice that if you count an attribute, it's like counting everything. If you count the number of categories, it's like count star. If you really want the number of distinct categories, you need to add distinct. I want to check that you're not confused. Which number is bigger? This or this? The first one or the second one? Which one is bigger? The first one. More examples, you can also aggregate over any expression. You can multiply the price and quantity and then take the sum of all these price times quantity. Nothing, no surprises here. So let's go to the interesting examples, which are with group i. So in practice, we don't want a single aggregate. But we want many aggregates, and every single aggregate operates over a subset of the tuppers. This is done with a group i construction, and I suppose many of you have seen it. Let's read this query. It says, go to purchase, select only those whose price is greater than 1, and then create different groups by the value of the product. For every product, there will be a different group. And in each group, compute the sum of quantities. So the result now is not just a single number, but it's a table. It's a table that has the first columnist product, and the second column is the sum. It's actually called total sales. Let's see how it works. It works like this. First, the system computes the from and the where clause. So it eliminates. It selects only those tuppers that satisfy the where condition. Then it groups them according to the attributes in the group pi clause. And then for each group, there will be a single output tupper. And that tupper might have an aggregate value that's specified in the select clause. So in our example, the condition was that price is greater than 1. So this tupper is removed. And then we have two groups, one group for bagel and the second group for banana. And then we need to return, if I remember correctly, the price times quantity or it was the sum of quantities. So for the first group, we return 40. And for the second group, we return 10. So here it is 20. So here they are. The important thing to remember here is that every group, no matter how big the group is, will produce a single tupper in the output. So these are SQL queries with aggregates. Notice that there is a wrong way to write them. I wouldn't say necessarily wrong. There is an alternative way to write SQL queries with aggregates that I strongly advise you to avoid, which is using nested subqueries. Why avoid this for the same reason? Because optimizers are much better at recognizing and optimizing group i's than they are at optimizing nested queries. So here is another way to think about that query. It's a rate over purchase. We treat only those whose price is greater than 1. And kind of return the product name. But now for that product name, you need to add up all the quantities, which is something you can do in any language like Java, ML, Scheme, X-Sharp, you can do this by using a nested loop right here. So how do you do this? You go over purchase again and find those products. I'm involved all entries that correspond to our current product. So we look at all other sales of the same product. Add them and return their sum. And this is what we stick in next to the X dot product. Notice the redundancy here. We need to use price twice, right? What happens if we don't include this price? Then products where all the prices are less than 1 will be included in the zero quantity, which actually highlights an important observation. Namely, every group returned by the group i is non-empty. That means that if you're summing positive values or if you're counting, you will always get a non-zero value, which sometimes is what you want. Sometimes it's not what you want. And then you have to work around that using either nested sub queries or not, or outer joints. I will show you how to do this. So that is the observation I just mentioned. Every group is non-empty. You can't have empty groups. You can't take, if you're returning count, you can never get a zero. In order to get that zero, if you want it, then you need to use a left outer join, which I still hope to be able to cover today. Finally, the last piece of syntax that I'm going to describe about SQL is the having clause. This is like the where clause, but there is a difference. The having clause allows you to check the value of an aggregate. So for example here, the where clause only checks one top at a time. The having clause applies to an entire group. And as a consequence, it can check and aggregate value that applies to the entire group. So this is the entire SQL language that we would use in the class. So let's revise the syntax. The SQL query consists of select, from, where, group by, and having. It might also have an order by clause, but I didn't include that. How is it evaluated? First of all, let me actually go in the next slide and show you how it's evaluated. First of all, the from where part is evaluated using the formal semantics I gave you earlier with the nested loop semantics. Then the system forms groups by the attributes a1, ak, these ones here. So now there are groups in this table. Then for each group, condition c2 in the having clause is checked. And if that condition is false, then the entire group is put away. Are you with me? And finally, for each group, we compute s. We compute the aggregate and possibly some non-aggregated attribute, and that's what we return. I'm going to go back one slide to ask you this question. There is a rule, a syntactic rule, which says every attribute that you use in s, let me erase here, every attribute that you use in s must occur in the group by clause. Why? Every attribute that you use non-aggregated in s in the select clause must be one of your group by attributes. Because otherwise, you don't know which value to pick for the group. Exactly. Because of the rule that the entire group will produce a single tupper in the result. So if you want an attribute for that group, you better ensure that that attribute has the same value for all the tuppers in the group. And the way to ensure that is to have that attribute be listed in the group by clause. And then you know that the entire group will have the same values for that attribute. Good. So you will practice all these kind of queries on the IMDB movie database. I want to show you two subtle constructs in SQL before we discuss nulls. That's the last topic that I want to cover. One is how to un-nest aggregates, and the second is how to find witnesses. So un-nesting aggregates is in the spirit of what we discussed so far. What does this query do here? It says here it is. Find the number of companies in each city. So the right way to write it is like here. You iterate over company. You group by city. And then you just return the count. But here is another way to write it. You iterate over company. And you return the distinct cities. And then for each city, you try a nested subquery. You discount how many other companies are there in the same city. That's exactly what we discussed earlier. Now let's look at this query here. Now it becomes more interesting. This finds the number of products made in each city. And the right way to write it is like this. We join company and product, group by city. And we count for each city the number of products. Now the alternative is to use a nested subquery. And I claim that this time, they are not equivalent. And I would like to ask you, why are they not equivalent? What is the difference? Let me erase so you can read better. There is a city with no products. Can you please pick up? If there is a city with no products, then it won't be shown on the second one. Exactly. If a company does not manufacture any product, it's just it's here. It's a consulting company, OK? It doesn't have any products. What happens to that company? Will it show up here? No, it will not show up here because the group for that company is empty. It's a city. I should have admire the state. The city is such that all the companies in that city are consulting companies. They don't make any products. So therefore, there is no group without city. And that city is not listed. However, in the top query, that city first of all is included in the outermost query. It's one of the cities that is found in the company table. And then when you actually do that join and look for matching products, how many matching products will you find? Zero. There are no matching products. So in that case, the count is zero. But here, it's not included at all. That city is not included at all. That is a distinction. If you want the products, if you want to include the cities where there are zero products manufactured, the right way to do this is to use outer joints. So let me move on so we can discuss. We still have time for outer joints. OK, please, please, just slide it at home. It shows you how to unnest. If you want to write a query with nest subqueries, well, that's not ideal. If you write it correctly, then it looks like this. And I invited to read this at home. I need to show you this important technique for writing SQL queries that you need in the homework, which is finding witnesses. And the query here is for each store, find the most expensive product in that store. So we need to list a store name. And then we look at the price. And we need to find the store ID as a foreign key. We need to find the product that is the most expensive product manufactured at that company. Let me show you the easy part. It's easy to find the highest price because that's just a max aggregate. So we iterate over, we join store and product. We group by the store ID. And we compute the maximum price. But this is not what we want. What did we want? Products. We want the product that is the most expensive. By the way, actually, I don't want the SID. I want the store name. And the store name is called this name. A quick question to you. What happens if here I write store name? Because that's what we want. We want the store name. We'll throw an error, but isn't in the group I close? Then it's in the group I. Correct. But now let me do this. Let's suppose I list both. Because actually, you do want both. Now it makes sense because SID is a key, right? Therefore, in every group, the name will be the same because it's just the same company. Will that work? No. In group, I know it. It still doesn't work. Because for reasons I actually don't understand why, SQL, at least SQL server where I try this, does not make this inference. Even though SID is a key, it will still not allow you to list other attributes unless you also list them in the group I. So the right way to do it is to write here store.name. So getting back to witnesses, how do you find the matching product? And the answer is there is no easy way. You have to do it the hard way. The hard way means maybe use a nested query like here. You find the maximum price for every store. Then you reuse this query to iterate again over all the stores in the product. And now you look for only those products where the price is this maximum price that you have computed in the subquery. Yes? So can we use the all quantifier to have a simpler query? Yes, the question is can we use the all quantifier? And the answer is yes, I hope you digested this slide. Here is another one with all. Very good intuition because that is the property that distinguishes the products that you want. It's a product whose price is bigger than all the prices of the other products made by that company. And you can write it like this. And I think there is a third way. No, I didn't have it. But you can find another way to write it. Couldn't we use an exists here because I think you can negate this all and get an exist and have an unnested query or it's not possible? No, you cannot unnest this because this again is a non monotone query because, well, if you add another product that's more expensive, then it will modify. It will kick out the previous product. But it didn't change the number of tools at all. It doesn't change the number. But the definition of a monotone query is subtle. You have to read it carefully. It says that all the tuples that were returned must continue to be returned after you have increased one of the tables, which means that all aggregate queries are non-monotone. Once you have aggregates, no matter what you write, once because you're changing the value of the aggregate function, that makes them non-monotone. Good, so I wonder what to do now. We still have about four minutes, but I don't know how much energy you have left to go over nuts. Or would you like to go over them by yourself? And then next time I can take questions about nuts. No opinion? Let's stop here because I see some tired faces and I don't want you to get there from the first lecture. Please read these at home. And we will only discuss them if I get questions from you next time. And otherwise, we will move on to logical design, to conceptual design and logical design. OK, have a nice day. See you next week.