 So hello, everyone. Welcome to the last lecture. We still have one minute, but I think I'm going to start and start slowly, allowing people to still come in. So I'm not going to discuss the finals. Don't ask me about the final. A few people are still working on the final, and they're probably not here. I want to mention just a few words about homework 7. I apologize for the snafu. My intention was to not have any teaming on this homework, and someone pointed this out to me. But we forgot to remove this from the project, from the homework description. So if you teamed up, OK, just submit it twice. If you're still working and you teamed up, please try to continue as much as possible without partnering. We want to have a separate submission of this homework from each of you. And somebody asked me for an extension, and that's fine. We can have an extension until tomorrow morning. Don't spend your night on this homework. I don't think it's worth. But if you need some more time, Jessica will open up. We'll make sure that we accept homework until tomorrow morning at 7 o'clock. I think that was the latest on that line. So that's about the last homework. And other than that, today we have a lame duck lecture. But I want to show you two very cute topics. I'm going to try my best to make it worthwhile for you to come here, come in here. So these are topics that are in data management, and they are not about performance. So far, most of our discussion was about this data independence and the idea that we can still achieve performance while keeping this independence from the physical layout vision. But data management is actually a much richer area, much richer topic. And what I want to briefly show you today are two pretty hot topics in data management. I'm sure you heard about the second one, data privacy. That's very hot, but it's also unsolvable in some sense. But the other one is getting very, very important. And the abstraction to think about data provenance, they became pretty clear in the last few years in the research community. And they are not available yet in any book, or as far as I know, in any product. So this is what I'm going to spend most of my time today talking about data provenance. It is also called data lineage, or data pedigree. And the slides that I have, they're coming from Valtanen's keynote at a conference last spring, EDBT. This is European Database and Technology Conference. So these are his slides, essentially. It's a subset of his slides. So what is provenance? Here is a dictionary definition. The fact of coming from some particular source or quarter, this is a type of origin or derivation. That is the definition of the dictionary. What we want to do, essentially, is that we want to keep track of where the data comes from. The main motivation for the researchers comes from scientific data management. The scientists, they start from some raw experiments. Then they process that somehow. Then they run some tools on the process data. They integrate it with other data. Then they run more processing, more aggregation. And now they want to keep track of, for example, they change one of their basic experiments. So now the question is, which of the derived data is affected? Or conversely, you don't trust one of the results in one of your derived data. The question is, which of the input data contributed? So that is the purpose of data provenance. All we want is some notation, some kind of formula that allows us to express how the data got there, how it was derived. So two slides about the terminology. So should we call it data provenance, or data lineage, or data pedigree? And you'll hear all three terms used today for data provenance. But the cute information to keep in mind is that pedigree is used for dogs. And lineage is used for kings and royalties. And provenance is for art. So in data, let's be artsy and call it provenance. But I should warn you that I often call it lineage. I rarely hear the term pedigree. But I think it's used in industry. But research is between lineage and provenance. We are with the arts and the kings, not with the dogs. So what kind of data transformations would be nice to carry the provenance with them? Well, in general, any data transformation would be nice if we could automatically carry with it the provenance of the underlying data. But right now, we only have the technology for queries and views for these two. For the rest, it's a black magic art. It's a matter of trying to keep track of some information about how the data was processed. But you won't be able to store the entire tool that processes your data. In the case of queries, it's kind of more under control. And this is a case that we need to understand first. When we process data with transformers, we integrate it, we query it, we join it, we select it, we project it. And now we have a tuple in the result. We should be able to answer the question, how was this tuple derived? Where did it come from? OK, so let's do the beautiful mass that came with it. The mass that you need to keep track of provenance is called semi-rings. So how many people have heard of semi-rings? Not that many, I guess. How many people have heard of rings? What is a ring? Almost everyone. It's a mathematical definition. And semi-rings is half of it. So I'll show you what it is. So let's start with a simple example. Let's suppose all I do is to create a view where I join a table R with a table S on some joint condition. And now this tuple here, the tuple ABC, joins with a tuple DBE because a joint condition decided to join them. And here is my output. And I want to keep track in my provenance data. I want to keep track of the fact that this tuple came from these two tuples, from P and R. So I'm going to use this notation, P dot R, which just says that to derive this output, the ABC DBE, we made joint use of both P and R. Think of it like a record that contains a pointer both to P and another pointer to R. But it's a record with a side information that says these two were jointly used when we derived the tuple. Let's see something else. Let's see a union. So let's suppose we union R and S, and we do this using duplicate elimination. So it's a union. It's not a union or. It's union with distinct. So now both R and S contains the tuple ABC. It was called P in R. It's called R in S. But there's only one copy and the answer. So what is now the provenance of this tuple? Well, we need to keep track of the fact that it could come either from R or it can come from S. And again, we have two pointers where we keep a pointer to P and pointer to R. But we need the side information that tells us this is an alternative use of the data. We can either have P or we can have R, and any of them is enough to produce a tuple ABC. So so far, we have seen two operations. It's the joint use and the alternative use of the data. Let's see another one. Suppose we do a duplicate elimination, which is like a standard projection. So now we only retain the AB attributes. And as a consequence, all these three tuples, they get combined into a single AB and the output. This is another instance of the alternative information. Now we say here that this tuple has a provenance, which is P, alternatively R, alternatively S. So that's the notation that we need. OK, so here is a complicated example. So this example is, what does it do? Well, it projects R and then on AC, then it projects on BC, then it projects, then it joins them, then it projects on AB, projects on BC, joins these, unions them, then projects that is out on AC, and finally does a selection. So we don't have to go through the details. And actually, they're missing from the slides, but you can work them out. I know they work correctly because I read the paper that has this example. And what you get in the result, as you know, it's a table with attributes ANC. Here are my ANC. And here are the tuples that participate. And here are the expressions, the provenance expressions that annotate this table. OK, so let me erase everything so we can go over them slowly. So obviously, you can see here joint, which means we made joint use of P and P. Here is another joint. Here we made joint use of R and P. You can see instances of plus, which means alternative use of tuples. You can also see instances of 0. What does a 0 mean? Where does a 0 come from? Any hints? We didn't discuss 0 before. Look at the last operation in this complicated plan. You see it? What is the last operator? It's a selection on the attribute c equals e. Now the first couple, does it satisfy the selection? No, so because of that, we put a 0. The next tuple satisfies the selection. So because of that, we put a 1. So that's how we get a 0 and 1. This is for the final selection. OK, so let's see what we have seen so far. So we have this mysterious phase of annotations. This is going to be our provenance annotations. Now we have relations where every tuple is annotated with something in this k, in this space k, which is a provenance annotation. Now k is like a mathematical object. It has some structures, some interesting properties, not properties, but operations. There is a dot, which means joint use, which means, which intuitively says I need both things in order to get my stuff. And there is plus, which means alternative use. It's like for union. I can either have the left thing or the right thing, and any of them is fine for my tuple. And in addition, we assume that k contains two special annotations, the 0 and the 1. Let me see if we have better explanations. Yeah, 0 means go away. It's like it's a tuple that's not there. It's like the provenance of something that you shouldn't have. It's a throw away provenance. And one is the opposite. It's like saying, yeah, I'm happy with this. I have it. There are no, nothing to block me from having this tuple. So now we have a mathematical object. This is why it's so elegant. We started from writing down some very natural provenance expressions. And it turns out that this space that we need is essentially a mathematical object. It is a set k with two operations, plus and times, and with two special elements called 0 and 1. Do you remember from mathematics any algebraic structures that have a plus, a times, a 0 and a 1? Sorry? Right. Take. Arithmetic, yeah, then the natural numbers, right? Absolutely. But let me think. What about modular arithmetic? Same thing. If you take all the numbers, modular 7, there are only seven of them, 0, 1, 2, up to 6. You also have plus, you have times, you have 0, and you have a 1. In general, if you have any such set that has a plus, a times, a 0 and a 1 with some properties, then it's called a ring or it's called a field if it has some additional properties. These are algebraic structure that have been defined in mathematics. And I'll show you in a bit how we get there. Now, here is why we need algebraic structures on the provenant expressions. Remember that the queries that we write satisfy certain algebraic laws. We can't forget those laws. We can't ignore them. We must be prepared to take expressions that represent the same query because they are equivalent under these algebraic laws. And the reason why we need to do this is because all the optimizers, they feel free to use these algebraic laws to optimize. So what are these laws? Well, one thing is that union is associative and commutative. And we know that. You had to study the fact that union is associative and commutative. Joints are associative and commutative, and they're distributed over union. It's one of the law and one of the slides that we had. And more stuff holds. But interestingly, optimizers they rarely use is actually they never use. They never use these laws, although they hold. And why don't they use these laws? It is written on the slide, right? Because these laws, they hold over set semantics, but they don't hold over back semantics. So the optimizer, for them, the algebra is under back semantics. So therefore, they don't use laws that only hold for set semantics. OK, so now here is a mental experiment that you need to go through. Imagine two plants that are equivalent under this law. So let me write them something like this. The typical thing. R join ST, this is equivalent to R in join of S and T. Now if you compute the provenance of a tuple on the left, you'll get an expression which is of this kind. It's like you must use both A and B, and then you must use C. The same tuple on the right will have a provenance which is more like A joined with BC. And of course, we want these two provenance expressions to be the same. We insist that they be equal, because we don't make any distinction between these two plants, so we better not make any distinction between these two provenance expressions. So it turns out that if you add to this algebraic structure exactly the loss that you need in order to ensure that queries that have the same semantics also have the same provenance expressions, then you get out of this commutative semirings. And I'll show you on the next slide. I hope it's on the next slide what a commutative semirings is. And as a consequence, the relations that you need to manipulate in order to keep track of provenance are relations annotated from a commutative semirings. Very interesting. So let me show you what a commutative semirings is. And this is something you might remember from algebra. So K is, of course, the set of all annotations that we are willing to accept. Think about these expressions that build up as we compute provenance. Plus is an operation in that we allow these annotations, the alternative use. It must be associative commutative. And 0 is its identity. Remember, 0 meant no provenance at all. I don't have this table. You can't have this table. And it has to be the identity for the plus. Times must be associative and must have one as identity. And moreover, times must distribute over plus. And this is because joint distributes over union. And moreover, you must have this law. So this is called a semirings. And it is not a ring. And a ring plus has an inverse element. So it always has an inverse. And times does not have an inverse. If times also has an inverse, then it's not called a ring. It's called a field. And this is massimastic. In algebra, they're very concerned about the distinction between a ring when you don't have necessarily an inverse for times and the field where you do have an inverse. They have completely different properties. Now a semirings is one in which not even plus is required to have an inverse. Now in addition, we want times to be commutative. Because if you join R and S, the same as joining S and R. And all optimizers, they will happily switch the order. And therefore, we want R joint T to be the same as T joint R. So that's a semirings. And this is semiring provenance. If we have such a semiring, and I'll give you examples of very interesting semirings, then you can write provenance expressions. And depending on what semiring you choose, you can keep track of provenance at various levels of granularity. OK, so now let me show you what's really great about these provenance expressions. Let me see a little bit what to see. So look at these expressions here. I want to simplify them. So for one thing, how would you simplify this expression here? If it's a semirings and everything times 0 is 0. So this goes away. I mean, it's 0. This is equal to 0. And if it is 0, you can actually drop that tuple from the relation. Right, there is no provenance. How would you simplify this expression? In any semirings, if you multiply by 1, it gets the same element. This is t times r. This is 0, of course. Now this is more interesting here. r times r. We can't simplify it, but you can write it a more clever way. How can you write it? r squared. Yeah, it's a standard notation for when we multiply something with itself. But here we get r squared again. So what's a standard notation for writing when you have two things? The same value added twice. How do you usually write it? You write it like this. It doesn't necessarily mean that tu is an object in your algebraic structure. But this is how you write it just as an abbreviation. So therefore, these provenance expressions, they look like this. So let's backtrack and see where we are. We had a complicated query that took just this table r and produced an output. And it was a very complicated query. But if we carefully computed the provenance of every topology output, this is what we get, these three expressions. So they look like polynomials, like something you studied in college. But I'm going to show you now that they have a very, very interesting semantics in terms of provenance. But let me see where does this come? Yeah, it comes right here. So let me raise and ask you how to read this. How would you read this? How would you describe this in English? The provenance of Ae is pr. How was Ae derived from p and r? And both had to be present in order for Ae to be in the output. OK, this is much more interesting. How was this derived? How do we say this in English? How many choices are there? It kind of depends how you count. Actually, there are three choices. There are three different ways to derive it because there is r squared, r squared, and rs. One of them, one of the three choices used what tuples, r and s, right? So we used r and s and produced one of the other two choices. What did it do? It used r, but it used it twice in two different places in the query. It used r twice, and this is how we derived it. Very interesting. So this is how you read the provenance semiring expressions. They tell you a lot of detail about how the tuple was derived. And you can use this detail for further processing. So let me show you some examples. So this says in English what we discussed. So at Penn, where this research was done, the semiring expressions were implemented in a research prototype called orchestra. And the first thing they did, not the first thing they did, but one of the things they did, they did deletion propagation, update propagation. So here is the thing. You have computed q, and you stored it. And an orchestra is stored on a different server. So it's a distributed system. And now the original source says, I want to delete this tuple. What should I do to the output? I don't want to recompute the query. How can I quickly, how can I cleverly update my output? What should I do? Exactly, delete all of them. But actually, in a slightly different way, you set R to 0. Remember the 0 in our semiring? That's allowed. You can always, 0 is a valid element of the semiring. Set R to 0, and see what happens. What happens is the first one? 0, the second one? 0, and the third one? 2S squared. So the first two disappear. And the second one stays around. And you also know its new expression. So if anything happens to the previous one, then it disappears. And I think the other slide shows this. So this is what they say. Set R to 0. And then this is what we get. And this is what we consolidate. OK, let me show you something else that's cool. So this was essentially set semantics. So every tuple in R occurs only once. And every tuple in Q only occurs once. But now I'm going to change my mind. And I'm going to say R becomes a back semantics. So the first tuple will occur three times. The second tuple occurs twice. And the third tuple occurs four times. So as a consequence, the tuples in Q will also be duplicated. How many times do I get Ae? Six times. Because it's 3 times 4, right? P is 3. And R is 2. C is 3 times 2. 6 times. And how many times do I get the E? 2 times 4, which is 8, plus RS, which is 8 again. So 16 times. So you get the set semantics, the back semantics, for free from these annotations. They carry the allow it to recover the back semantics. I found this quite cool, right? So you imagine annotating these tuples as they move around. And then you can use these annotations in lots of interesting ways. You can use them for update propagation. You can use them to switch from set semantics to back semantics. And I will show you other usages in a bit. So what interesting semi-rings are there? So far, we just use these annotations, these funny symbols. But there are concrete sets that you might recognize. And somebody said arithmetic. Arithmetic is an instance of a semi-ring. And they give us different information about the provenance. So what happens if we replace? Let's go back here and imagine replacing all these symbols with numbers, as we did before. So now instead of this, we get a number, right? Then the semi-ring is this. The natural numbers, where plus means just addition. When you have alternative, it's like counting. You just add them up. And times is multiplication. If you have joint use, and the left one is used twice. And the right one comes with three copies. Then the joint use can have six different combinations. So one particular semi-ring is that we can use natural numbers. And then we can just get the back semantics. Of course, if you replace those expressions with numbers, then we lose something. We don't trace the provenance as we intended to. But the point is that the same abstraction, namely that of a semi-ring, explains to us both the provenance and the back semantics. It also explains the set semantics. If we just want the plain vanilla, set semantics, well, we didn't need semi-rings to start with. But if you insist on using a semi-ring, the semi-ring to use is a Boolean. So it has two values, true and false, which is 1 and 0. And often, sometimes they are denoted top and bottom. Yeah, these are switched, now I realize. What is plus in this semi-ring? What should it be? It's all. So this is my plus. They are switched. And times is n. OK, there are other interesting things. This is the most interesting thing. But maybe we'll come back after. No, let's discuss it right now. And actually, I think I have some animation for this. Here is an interesting application. Think about access control levels. So now you're designing a database for the military. And every single record has an annotation which can be public. It can be classified, secret, and top secret. This is what those numbers represent, public, top secret. And in the middle, there is classified and secret. And as you process your queries, you would like to keep track of the secrecy of the data that you're processing. So for example, what happens if you have to computer join? And one of the tuples is public, and the other one is secret. So you join a public tuple with a secret tuple. You get a tuple, which is how? Which is secret. What happens? So therefore, our times is what operation is in this order set. If I take two elements, public like public and secret, and I do the joint use, what is the result? The maximum. Very interesting. Still a summary, right? The max. Now what about alternative? So the user asks a query. It's a union of two tables. And I get the same tuple in both tables. But in one table, it's secret. And the other table is public. How should the answer be? Clearly public, right? Because if you found it public in one table, then, well, it means it's available. So the plus, if times was max, the plus is min. So that means that if you annotate your tables with this emery, which is the access control emery, then it means that the results of your queries give you for free the annotations of the classification level for the secrecy of the output. And it's the same abstraction. You don't need to learn something new. It's the same semiring abstraction. It's very, very, very nice. So what are 0 and 1 in this semiring? That's very interesting. What is 0? Yeah, it's so secret that nobody can see it. It's no answer at all. So this means that if you're joining something that stops secret with something that 0 is 0, right? No such thing, right? What is 1? If you join something that's 1 with something else, then the 1 doesn't matter. It's just something else. So that's a public. So 1 is public. You can play the same game in this semiring, which takes numbers as levels of confidence or trust. So here, bigger means you trust more that information. And then the operation would be min, I think, stands for, wait a minute, what's going on here? Yeah, min is plus. And plus is times. So it means that no, sorry, it's not distrust. It's how much or cost. Our cost is best. This is the cost of the data. So imagine now a setting in which every item in the database costs some money. You want an answer from a database. You have to pay for all the items at your query touches. So now if you do a join, what is the cost of the join tuple? It's a sum. But if you do a union, one tuple costs so much. The other tuple costs so much. What's the cost? Well, I wouldn't pay more than. And we would pay for the minimum one. So this is why plus represents the cost of join. And min represents the cost of alternation of alternative, either for duplicate elimination or for union. Very interesting. And there are other examples that people have used. This is very exciting for receptors. But maybe we should skip those examples. So I wonder how deep I should go into this. But maybe it's worthwhile to consider. Let me tell you why this was so confusing. So research into data provenance started in the early 90s. It was quite active in the late 90s, early 2000s. And people were very, very confused. This is a very highly cited piece of work by Widom, who is a professor at Stanford. And they wanted to understand provenance in data warehouses. They didn't come up with a semi-ring idea. This is a very recent development. But they said, look, if you compute that complicated query and to get the answer dE, then they would annotate this with a tupper set you ever used to get that result. So remember, dE had a complicated expression that involved R and S. So in this early model of lineage, they called it lineage, the lineage was just a set R, S. So what does join mean here? If this tupper was obtained by touching several tuppers. And this tupper was obtained by touching several tuppers. Now, when I joined them, which tuppers did I touch? The union. So join is just union. What about true union, alternative? So I could either use this tupper or I could use this tupper. So which of the base tupper did I touch? Also the union. I don't know why there is a star. Is there a wrinkle here? No, this is a semi-ring. So this semi-ring, the plus, and I wonder, why is there a star? No, there is no reason. So the plus and the times are the same operation. They lose the distinction between alternation and between joint use and alternative use. And the neutral elements are the same. OK, so this was one. So then, actually not long time after that, Peter Buenaman and others, they said, let's keep track of more detailed information. And they said, look, this tupper DE, actually, don't tell me that you just touched R and S. But tell me, what is the minimum set of witnesses that produced B and E? And if you remember, you could produce it by just taking R or by taking R and S. So for them, and this was called y-provenance, the y-provenance with a set of sets. Now, in this set of sets, the joint meant like this in one. So what is joint use? For one or some, I can use R or RS. And now I want to join this with another tupper, which had RT, SU. So then, what are the alternatives for the joint tupper? What kind of set of witnesses can contribute to the joint tupper? Well, for the tupper on the left, you could start, for instance, from this, from R. Tupper on the right, you could start from RT. So you union these two, and you get RT. When you move on to R and SU, that's another combination, and you union them, you get RSU and so on. And you take all four unions of sets. So this operation doesn't have a good name, but this symbol is good enough for it. So this set takes two sets of sets and returns a set of all pairs of unions. So the union is pushed inside. So the neutral element for this joint is a set consisting of the empty set. And the neutral element for union is the empty set. Different things. OK, so let me move on. Then, also when this was actually in the same paper, they also looked at a different kind of provenance that was called Y provenance. And they said, look, if for this tupper, you had the witness just R, and then a separate witness, RS. Don't consider RS, because R by itself is sufficient to return to construct this tupper. So this was the set of minimal witnesses, which corresponds to another semi-ring. We shouldn't go over it. It's actually the semi-ring of Boolean expressions. And then there were others that in trio, also the group by Jennifer Widom, they had yet another refinement of a semi-ring. And if you're confused by now, I'll show you a nice picture where everything fits together. And the nice picture is right here. So these are all the semi-rings that make sense. Let me put it this way. And as a top, these are the polynomial expressions. N of x means polynomials with integer coefficients over some variables. And x means not just one variable, but multiple variables. So what kind of annotations are you allowed to have if this is your semi-ring? Well, you can have stuff like this. 2 times R squared S plus 3 times RS cubed, in which RS are the variables. They represent pointers to tuppers in the input. And then you can have arbitrary annotations with these polynomials over those variables. OK, so let's see what are the other pieces of information. Actually, this is much better. So here they are. So at the top, we have just polynomials. What you can do with these polynomials, and this is really cute, you could drop the coefficients. So this says I can use x squared y twice in two different ways. You can forget it. And this gives you something else, which are polynomials over Booleans for some reason. Or you can drop the exponents. You can say, why should I care that I used y twice? You can retain just the fact that y is used once, but keep track of the five alternatives. So this is another semi-ring. You can drop both, and then you get the y provenance. But this still keeps non-minimal sets of witnesses. So if you just collapse to minimal sets of witnesses, I don't know where that disappeared. Then you get this. Then perhaps option, you get this positive Boolean expressions. And if you collapse everything together, then you get Widom's lineages, just sets. OK, so I think that's a picture that I wanted to show you about data provenance. I have this slide, which I shouldn't blow you away. Somebody studied one of the students of Bartanen, TJ Green. He studied query containment. So we discussed query containment for SQL queries. And I hope you enjoyed that. It's kind of an insider knowledge. Not many people understand query containment well. It turns out that query containment is a very interesting theoretical problem. Once you change the setting, for queries with set semantics, it's what we discussed. It's equivalent to the existence of a homomorphism. But once you change that semantics, once you move to back semantics, people know how to check equivalence. It's actually trivial to query that equivalent if and only if they are the same, they are isomorphic. But nobody knows how to check containment. Very bizarre. It's a major open problem. Nobody knows how to check containment. But what if instead of back semantics, you're using the semiring annotations? Well, then containment starts to differ depending on the semirings that you use. And what TJ did, TJ Green, while student, he studied this containment problem for all sorts of semirings. And just to give you a flavor of the amount of effort that goes into such a research project, he found that these contaminants, they are interrelated in a complicated way, depending on which semirings you pick. And this slide is a snapshot of what his results in that paper. OK, so that's all I wanted to show you about data provenance. A simple CRISPR abstraction, which is if you know what the ring is, you drop half of it and you get a semiring. And this is what turns out to be needed to keep track of where the data comes from. I think it's very nice. Any questions about provenance? Are there any systems out here that keep data provenance? Do you know, in a kind of a generic fashion, I'm pretty sure that many applications, they have their own implementations of provenance. Now, you need to know where the data comes from. So probably you have a field. If data in table R is copied from the data in table S, then you might have a field there which tells you where it was copied from. But these are all solutions. They don't extend to complicated queries. If you want a general solution, then you need semiring expressions. OK, so I'm actually going quite fast. The rest of this lecture, I have actually even less information. But it's about a very hard problem. And I don't want to show you details that I know don't work. This has been studied much more intensively than provenance. It's a very old problem. It's about data security. And I will tell you what people are doing today about data security. So the definition of data security is that you want to protect the data from unauthorized disclosure and modification. So in a strict definition, it's for both things. It's about disclosure and modification. So what does it mean? What does it mean that we want to protect the data from disclosure? So for example, on Facebook, you want to protect your pictures from people that you didn't authorize to see. That's an example of protection. But if you work for a company that has customers and you have access to the database of the customers, and now you want to give it to a partner company for something, then you might want to protect. There's certain information about your customer that you don't want to give out to your partner company. So that's the kind of protection that we care about. The modification is more subtle and it's actually less well studied. The modification goes like this. Suppose one day you go to Facebook and you check there and your address. Instead of being in Seattle, now it is in Honolulu. Did you modify it? No. So the question is, can Facebook prove to you that you were the one who modified it? That's called integrity. It's actually much more interesting if you hand out the data to somebody else. So suppose you prepare an important data set. And now you distribute it. You give it to your friends. You give it to their friends. And eventually it reaches me and I trust you. But I don't trust all these intermediate people who got their hands on it. And now I would like to check that the data in these comes from you. And it hasn't been modified by the intermediaries. So this is integrity. It's much less studied. People have studied much more the confidentiality. How do you hide data? So let me skip this. I want to show you an attack. It is famous in this research community. It's not as famous in the real world. I think the AOL attack is much more famous. Did you hear about the AOL attack? The Analyzation one. Yeah. Maybe I should mention that one too. But let me go through this attack first to give you a sense of what this was discovered by Latanya Sweeney when she was a graduate student at MIT. Now she is a professor at CMU. So in Massachusetts where MIT is, they have for all state employees, they have this group insurance commission who buys health insurance for the state employees. But it's a public institution. This is GIC. So they have to publish anonymized data about all their employees. So this was published back in 1997, 1998. I think it was already on the web. So it was anonymized because all the names and associated query numbers and address were removed from the data. All that was kept was this anonymous zip date of birth, sex, and then the medical data that had to be published by law. So this is private because all the names had been removed. So what Latanya Sweeney did, back then the voter registration information was not yet online because it belonged to a small county. But she went and she could buy a floppy disk. It was a floppy disk actually that she bought for $20 with all the voter registration information in Cambridge, Massachusetts, a very rich county. Are people from the east coast around Boston? Anybody from Boston here? Not many, but you probably know. Cambridge, Massachusetts is a very rich county. OK, so now she had voter information. And look, you can actually get, you can get even today, you can get voter information with a name, party, and the address. This is public data. It has to be public. So guess what she did? She joined them. You can try to, is this an exact join like we teach on databases? No, it's like shooting in the dark. But you know, you don't need the perfect join. Or you need to find some interesting information. And this is what she found. Back then, the governor, the governor of the state of Massachusetts was William Weld. And therefore, he lived in this wealthy county, and he was in Boater, right there, William Weld. Out of the people in that county, Six had the same date of birth as William Weld, which is not very surprising, right? There are 300 days in a year times, six, two thousand. So probably there were like 2,000 voters in that database. Half of them, not surprising, were men. I mean, statistics at work. Half of them were male. And how many do you think lived in that zip code as well? Only one. He was the only person in that zip code. So for governor William Weld, the combination of date of birth, sex, and zip uniquely identified him, which means that all the entries in the GIC database that had this combination of date of birth, sex, and zip uniquely identified governor Weld. And she could get it to all his records. And I need to tell you, unfortunately, she did not give us any spicy details in the research paper where she published this. She only said that she could get access to his health records. So I found this is a very surprising finding, right? Because if you think about system security, it's all about viruses and attacks and denial of service. And the systems are supposed to work perfect, but they always have these bugs. And it's a never-ending patch and find new flaws and repatch, and it seems to be a never-ending story. And we all blame Microsoft and Apple that they didn't even uniccess flaws. But here, there were no flaws. Everything worked as designed. Just two pieces of data. They were both supposed to be published. So that's exactly the conundrum in data privacy. How do we protect against such leakages? The other famous example is more recent. And I need to discuss it. This was from AOL query logs. So all the search engines, they collect all the queries. Every click is collected stored in their database. They have these huge databases of logs. And in AOL, they also have a user ID. I'm not sure how Google and Bing do it, but I'm pretty sure they have a way to identify users by IP address. They have all sorts of sophisticated ways to try to identify users across multiple log entries. But for AOL, it was trivia because everybody was logged in. So AOL had these entries where there was a user ID and search term or search terms. And apparently, it's very funny. When you look at these logs, people search for all sorts of things. Many people search for their social security number. They want to see if their social security number is somewhere on the web. It's actually a pretty clever search. But if you think about how people can use a log, they know exactly who asked that question because it's clear who asked that question. Sometimes they ask very private information about diseases or about particular friends that are identified by their names or addresses. They look up addresses. OK, so what AOL did, they replaced the user IDs with some random numbers. But they kept the linkage between the user IDs. So if you were a user 55, AOL would randomize this. Now you're 99. But all your log entries, they are all identified by 99. That's a big mistake. Because now an attacker can take these 99 records. And if in one of them, you made a mistake and identified yourself, then all the others are traceable back to you. So this is query logs which AOL made public for research purposes, a very noble gesture to make available data for researchers to do interesting research. But they thought it suffices to anonymize the user IDs. And this was attacked. It was broken, not by researchers, but by two New York Times reporters. So they spent time. They looked at these query logs. And they found some searches that uniquely identified an old lady living, I don't remember, somewhere in the south, Georgia. Does anyone know where that lady lived? But she was looking for, she had some problems with her cats. Her cat misbehaved. She searched some friends, some neighbors on the web. So by correlating five or six such queries, the New York Times reporters, they were able to uniquely identify that woman who was doing AOL searches. It was a huge scandal. So apparently one person is the leadership of AOL lost its job. I think it was a CEO or the chief information officer. A very top person in AOL had to quit and to resign over this scandal. And this is now cited everywhere, the fact that this was a major privacy breach. So what do we do with this? We want to publish data. Even if you don't want to publish it, it might still be available somehow. So how do we deal with this? I'm going to show you two approaches that people are considering today. One is key anonymity. This was introduced by Latanya Sweeney as a solution for the attack that she discovered, which is kind of useful. It's easy to see this work in practice, but it's not private. And the other one that gained a lot of traction recently is a theoretical definition, which is elegant and simple. And I'm going to spend some time on trying to describe to you that definition, although I didn't prepare very good slides. This was introduced by Sincat Vork, a theoretician, a very strong and complex theoretician. She works at Microsoft Research in Silicon Valley, which is very definitely private, guarantees privacy, but it's not useful in practice for reasons I describe. So let me start with key anonymity. So here is an example of a database that is not private. But it's not a good choice for my example. Think about the zip date of birth and gender combination that we had. Those were three attributes. And the problem that we had in the GIC record was that these three uniquely identified the record of downward. So GIC made a mistake to release a piece of data where these three attributes that somebody could get from a different place uniquely identified a person, an entry, I should say. This is the issue that I'm addressing here. But I use first, lame, and last name, which is kind of silly because normally you would strip them off. But let's just imagine four attributes. And the problem is that all these four topics are unique. So in key anonymization, what you would do, you would suppress some values or generalize them such that every tuple becomes equal to k minus 1 other tuples. Now what is k? k is a parameter that you need to choose. Nobody tells you what. People who took key anonymity very seriously, they recommend something like k equals 10, for example. So imagine trying to hide every tuple in a set of 10 tuples, and nobody can distinguish between these 10 tuples. So here's how it would work. So I'm going to drop the first name for, what were these? Harry and Beatrice. And I'm going to replace this with r star. I'm going to tell you stars with r, but I'm not going to tell you the last name. I'm going to generalize the h into an interval. And also, I'm going to anonymize the race. So now this is two anonymous k equals 2 here because every tuple occurs twice. So if an attacker uses some other data set where he knows the owners of that data, he knows the people there, and tries to link to this key anonymous data, then, well, the attacker will be confused up to a degree of k because every tuple is equal to k minus 1 other tuples. OK, so in the database communities where most people are engineers, they quickly embrace this idea, and they work a lot on implementing this very efficiently. But the problem is that it is not private. There are so many attacks, so many ways in which you can exploit the information that is still here to infer very private information that's hidden in this data. I don't have it here on this example, but I suppose you can see the issue. It's not clear what the attacker can do, and the attacker can do a lot. The attacker can infer a lot even if you do a key anonymization. And part of the reason is there is no mathematical guarantee to what exactly key anonymization hides. It's just a syntactic criteria. You check, it counts the number of tuples if they are key identically declared victory. But nothing tells you about what information can lead to the attacker. OK, so here comes differential privacy introduced by Cynthia Dwork. This is a mathematically rigorous definition. And I'm going to spend a little bit of time. I'm going to finish this lecture around 7.55 or so, like in 15 minutes. But I do want to give you a sense of differential privacy because it's a very good definition. And I think it's a good opportunity for you to see it. Differential privacy says this. I have my database. And I'm going to create an algorithm. The algorithm takes the database and returns an answer. I think of this answer as being a real number, maybe several real numbers. What does it mean? Maybe the algorithm is the algorithm that allows me to compute some aggregate and group i. If it's an aggregate with a group i, then I get multiple real numbers, all the group i's. They're actually integers. Maybe it's just a sum of some records. And then it's just one number. But the problem is, even if I just return this aggregate value, there could still be privacy breaches. And Cynthia Dwork defined mathematically a rigorous way what property does this algorithm have in order for it to be private. And here it is. The algorithm is first of all, it's not deterministic. It's randomized. You're going to add some noise to this output. So instead of returning a single number, it will return a distribution of probabilities for numbers. So think about this. You're running this query, and you might get five. You might get 5.5. You might get seven. You might get 13, with different probabilities. If the real answer is seven, then you get more probability mass around seven. Here is differential privacy. It says, suppose you're adding one more tupper to the database. To the database, what am I doing here? So you're co-opting another user. You're telling that to that other user, here is what will happen if you allow me to use your data. Then you run the same algorithm. You're not getting a single, the user is not getting a single number of this algorithm. It's a probability distribution. The algorithm is differentially private. If for any two such pairs of databases that differ in one tupper, this probability distribution differs only a tiny bit. It differs actually by an amount epsilon. So let me write this rigorously because I didn't write it here. So it's differentially private. It's epsilon differentially private if the probability that the algorithm on D returns a value x by the probability that the algorithm on D where we inserted D returns a value x. This has to be almost like one. So it has to be between e to the epsilon and e to the minus epsilon. So let me read this a little bit. And this has to happen for every x. It's hard to digest, but it actually sounds something quite simple. So suppose I have a database of patients and I want to allow some users, some statisticians, to ask queries over this database of patients. Like counting queries, they could count how many patients have flew and live in this particular zip code. How many patients have stomachache and live in that zip code or have our mail or whatever. They run these statistical queries. So I'm not going to give out the entire data. No, I'm just going to accept queries, compute them, add some noise, how much noise is not clear yet, add some noise, and then return answers. And now I'm going to go to a new patient and say, please allow me to use your data in this collection of data for which I have some users that are asking these aggregate queries. Here is a promise I'm giving you. I'm giving you the promise that if the query returns the value 17 with some probability today, then after you give me your data, if we return the value 17 with almost the same probability. Now the same query with the same data could also return 18 with some probability, because it's not a deterministic algorithm. It's a randomized algorithm. But that's OK. If you give me the data, the same value 18 will be returned with almost the same probability. And the difference between these probabilities is actually not a difference, but it's a ratio. It's again controlled by a parameter, by epsilon. And it is between 1 by e to the epsilon and e to the epsilon. So what happens if I choose epsilon equals to 0? What is a differentially private algorithm for epsilon equals to 0? Those statisticians which are serving with the algorithm, what will they observe about the answer that you're returning on the data set? What are these numbers on the left and right? They're 1. So what does this algorithm do? It might return 17 with some probability on this database. Now if you add a tuple, just a single tuple, it will return 17 with what probability? Exactly the same. Now if I add 1,000 more tuples, what will it do? Exactly the same. So you can get this perfect privacy only if your algorithm doesn't even look at the data. It doesn't completely ignore it. So this gives you a sense of what differential privacy tries to achieve. It tells you that epsilon is greater than 0, but it's probably close to 0. It says that as you add or remove one tuple at a time, the users of this database will not observe too much change in that statistics. But at the same time, it can be very useful for statisticians. If they look for an epidemic or flu and they ask how many patients have flu per zip code, group by zip code, then they can narrow down. I mean, they will still get numbers with some noise. But if these numbers are in the range of thousands, then a tiny amount of noise will not affect too much the resultants. It's still differentially private. So that is the appeal. The major difficulty with differential privacy, and actually instead of showing me the slide, let me explain this right here, is in the setting. It only works for one query. Now, if you return instead of a single number, if you return like 200 numbers, you would do a group by zip code. It also applies, but then it applies to 200 queries. But what you cannot allow, you cannot allow users to ask queries forever. And there is actually a theorem which says that if you allow users to ask queries forever, then it's either or. Either a hacker will eventually get to some private information in your database, or you must inject so much noise that nobody can use these queries for anything useful. So this is a major shortcoming of differential privacy. It does not, it needs to impose a limitation on the number of queries. So they figured out how to do this in a smooth fashion. You get a privacy budget. If you're a user, then you get a privacy budget, and then you can use it. You can say, I want a better answer to this query, and then you're using more of your budget, and then I want a more approximate answer to the next query, and you're using a tiny bit of the remaining budget. But once your budget expires, then they don't know what to do. Then you're in a dead end, because the theory says that from now on the data couldn't be compromised, and they don't know what to do with that. OK, so just one last thought about privacy, because it's something you hear a lot. And let me just tell you that most people, including myself and including many of my colleagues in the community, we confuse privacy with confidentiality. What I showed you so far was confidentiality, was this desire to hide private information. But in reality, privacy is something much more difficult to capture and to manage, which is the ability of the users to control how their private data is being used. So it's not about hiding, but it's about user control over their data. And currently, people don't know how to handle this in a generic way, how to address privacy. OK, any questions? I know I went very quickly over this, but I wanted to give you a few of privacy. Any questions about privacy or confidentiality? Because, well, I got to my very last slide. So what can I tell you at the end of a nice quarter? First of all, I really enjoyed talking in front of you both live and through the wonders of technology. So I hope that you got lots of lessons out of this course. The data management is so rich. It has a variety of topics. There is the old traditional topic of data modeling conceptual design that we need to know, asset properties. Then there is a lot of emphasis on performance. How do we take the logical design and map it into an efficient query plan? A lot of emphasis on performance. And today, performance can mean performance in parallel systems. It can mean performance on multi-processors, performance in many settings. But it's not just that. There are all sorts of other aspects of data management that are not related to performance, like provenance. People want to keep track of their data, but does it mean instead of hacking some ad hoc provenance approach, there is this beautiful theory of simmering that allows you to think about provenance. And data privacy. The problem with data privacy is that, well, now we have a nice definition, but we don't have a good solution to data privacy in general. OK, and on this note, I think I'll stop here. And I know some of you are busy with homework 7, which is going to be done in a couple, in a few hours. And after that, you're waiting for me for the final grade, which I promise myself I'm going to send to you as soon as I can grade the last remaining finals. So probably on Friday. I hope you'll get your final grade on Friday. Good. So it was a pleasure teaching you this quarter. And have a great holidays and a good experience in the rest of the program.