 Hi. My name is Prashant and I am going to tell you about some of our recent work formulating data deletion in general systems. This is joint work with Sanchim Gurga and Shafi Gulbasa. The broad context for this work is the ubiquity of data collection in today's world. Various agencies collect various kinds of data about each of us all the time and this data ends up affecting our lives in numerous ways. And given this growing importance of data, there is a growing need for better regulation of the usage of all this data. And indeed, this need has been recognized and addressed to different extents and installations across the world. In the US, for instance, data protection laws like HIPAA for medical data and FERPA for educational records have been around for a while. Many of these laws, however, they deal with the specific kinds of data that are considered more sensitive. But recently, there has been some regulation coming up that addresses the need to regulate data usage more broadly. Good examples of this are the GDPR and the European Union and the CCPA, which came into effect in California just a few months ago. These laws, they start with the premise that protection of an individual's data is a fundamental right. And consequently, they are much broader in the scope of what kind of data they cover. And both of these laws, they offer individuals a number of rights to control how and whether the data is used and by whom it is used. The motivation behind our work comes from one of these rights that has come to be known as the right to be forgotten. The right to be forgotten is granted to individuals in one form or the other by both the GDPR and the CCPA. And very roughly, it says that if, say, some company holds some data about you, then you can request them to simply delete or erase this data. And with a few exceptions, the company is required to honor the other request. Our interest is in what this last statement means. That is, what does it mean for a company to delete your data? Same straightforward at first glance, they have your data, they delete it. But I'd like to show you a couple of other simple scenarios, in fact, where things get more complicated. So let's see. Okay. So suppose this is your situation. Suppose you run a search agency that happens to collect data about people's heights for some reason. So you have your computer where you're going to store this data and this is what your storage looks like. And whenever someone tells you their height, you write it down against the name. Just say Alice's height is 5. So Alice tells you your height, you write it down, and then a couple of other conveniently named people tell you their heights too. And once you have everything, once you have collected everything, you store it in an array, let's say. For ease of access, you sort the array by the name of the person. And now, after you have all this, Alice comes along and says, say, I want you to delete my data, delete whatever you have about it, delete my height. And you say, okay, fine. And you go to your storage and you overwrite whatever part of the memory you had stored her data in with zeroes or excess or something. And tell her, yes, I've done this. I have processed your request. I have deleted your data. Is this acceptable? Is this behavior acceptable? So the, it's true that the agency no longer knows Alice's height, but it does know something with the data that it has left, right? That at some point, someone whose name starts with A, L, I, C used to be in the database and this person is no longer there, right? Is this behavior acceptable? Well, the thing is, it may or may not be acceptable, right? It depends on the situation and what the law says in the situation and so on. It depends on a lot of things. It depends on the norms and the situation and everything. But what I am more interested in is in what we mean by this behavior, right? Because, in respect to whether this behavior is allowed or not, we need to be precise about what we were talking about, right? More specifically, is there some property we can precisely state such that any system that has that property does not exhibit this behavior for whatever intuitive idea of this behavior we have here, right? Let me show you another example. Consider a slightly more complicated situation. But say the research agency, it employs a third party data processing service, right, to process its data and whatever data it receives, it forwards to the service to process. And when it receives a deletion request, it deletes the copy of the data that it has and that's all it has, right? Now, is this behavior acceptable? Again, it depends on the norms and the laws applicable in this specific situation. And again, I am more interested in what we mean by this behavior, right? The fact that Alice's data still exists somewhere in the world because the agency has revealed it to someone else, right? Can we capture this somehow? Because again, whether, in respect to whether this is allowed or not, we need to be precise about what we mean. When we say is this allowed or not, we need to be precise about what we mean. And again, you ask the question, is there some precisely stated property that captures such behavior? And this is what we are concerned with in our work. Can we come up with meaningful classifications of systems based on the behavior in processing data deletion, right? Because we are going to need such classifications and precise definitions when we make decisions about which system is okay. Is the system okay in this situation? To make the decision, you need to know precisely what properties it satisfies in this respect, right? And what we do is we give a framework to model the behavior of systems in the suspect along with the definition of a property that captures certain aspects of things we have talked about so far, right? So let me tell you about this now. So here is our setting. There is an abstract data collector, which in our case was the examples we saw was such agency and it has its memory where it stores things and it talks to a bunch of other users as we will call them and maybe some other service providers and so on and bunch of other people. And it receives data from the users and the users may sometimes ask for the data to be deleted, right? And to just abstract these things out a little bit, we will assume that all the data that the users send is sent in messages of the form and ID and content, right? So each message has an ID and the user wants it to be deleted. It says so delete this message, right? The message is the ID. And what we want to do is we want to compare what happens at the end of this, what happens in this world to what happens in a different world where the user never sent this data in the first place, right? In some sense, the definition that we will come up with will say something like the state of everyone, the knowledge that everyone else has in the world on the left after the data collector has processed the deletion request looks like the state of the world on the right-hand side. And since on the right-hand side, no one else knows anything about the user's data. On the left-hand side, this has to be the case as well. This is what we will say. So we want to compare with this world, right? And we will call these the real world and the ideal world, the real world in which the deletion happens, and the ideal world in which the data was not even sent. And what we are going to do is we are going to model these various users in various ways. But to start with, we replace all the other users in the system and all the other services users in the system with a single adversarial machine that we call the environment, which is a Turing machine or an algorithm that communicates with the data collector. And then we replace the user also with the Turing machine or an algorithm. And to make our definition broader, we allow communication between the environment and the user, right? And we place the following two restrictions on these machines. The first is that the user asks for exactly all of its messages to be deleted. It sends a message with some ID. Maybe it sends more than one message with an ID. But at the end of the execution, by the end of the execution, it asks for exactly all the messages it sent to be deleted. This is the first constraint. And the second is that the environment and the user are both polynomial algorithms in some security parameter that we will not dwell on too much. And we do the same thing in the ideal word with the same environment and the user. So here is how our definition will work, right? We will, someone comes to us with a data collector and asks whether it satisfies our property. What we will do is we will consider this hypothetical interaction between the data collector and this adversarial environment and this user and an arbitrary user which satisfies these properties, right? And we will let this execution run. We will let them communicate with each other. We will let the data collector do and its algorithm. Then the environment sends whatever it wants to send. We will let the user communicate whatever it wants to communicate. And the user may call the relation the course. So this execution proceeds. And at some point it terminates. And at this point, we want to compare the state of things in the real world with the state of things in the ideal world. Because the ideal world is also a hypothetical execution with the same environment and the same user. And the only difference being that the user's messages are never sent to the data collector in the ideal world, right? So this goes on and there's an execution in the real world and execution in the ideal world. And at the end of these executions, we compare things between these two worlds, right? And what we compare is first, we look at the memory that the data collector has. This is the state of the memory. And second, we look at the view of the environment. That is all the communication that has gone on between the data collector and the environment, right? And then we ask that these two together, the distribution of these two things together is similar in both the real world and the ideal world. So we say that a data collector is epsilon deletion compliant. If for all environments and users that satisfy the constraints that is specified earlier, the memory of the data collector and the view of the environment at the end of the real execution is the same as the respective things at the end of the ideal execution, right? So this is our definition. By default, we take epsilon to be something negligible in the security parameter, but the definition also makes sense for larger epsilon. Sometimes you need larger epsilon. This is our definition. Let us see what it says about the examples we saw earlier. The first example is the one where the research agency merely crosses out Alice's data when she asks for deletion. No, is this deletion compliant? Well, what is the ideal world in this case? What does the ideal world look like? Here, Alice never communicates anything, right? And so the memory at the end of the ideal execution looks like this, right? Like there is no entry for Alice. There's no access there. And so the memory itself, the memory of the data collector itself differs between the real and the ideal world. And this is clearly not deletion compliant, right? So then you ask, is there a way to make this deletion compliant? Is there a way to operate in this manner without breaking deletion compliance? And in terms of that is, there has been this study of data structures which themselves have the property that if you insert something and delete the same thing, it looks as though we never inserted it all, right? And these are called history independent data structures or rather history independent implementations of data structures which basically this property that the physical content of your memory if you are doing this implementation depends only on the logical content of the abstract data structure. So if the agency were to maintain this mapping between names and heights using a history independent dictionary, if you had a history independent implementation of this mapping, then Alice inserting herself and deleting herself would look the same according to the memory as Alice never having inserted at all, right? And this immediately satisfies deletion compliance. In fact, it's perfectly deletion compliant. Okay, what about the next example where the agency had this third party service and it was forwarding things to it, right? And let's make this somewhat strong and say that the agency is maintaining this history independent dictionary on its side, right? It's not just crossing things out, it's doing things on its side, right? Is this deletion compliant? It's not hard to see that it actually start, right? Because the environment here consists of everyone else, right? The set of the other users together with the third party service is a valid environment. And to this environment, the research agency is revealing Alice's data. But in particular, this entry in the third party service as memory would not have been there in the ideal world. So the environment knows something now that it did not in the ideal world, right? And in fact, it doesn't even matter that it's there in the data processes memory, right? Like, suppose you are in an even better situation where the data processing service also maintains history independence and the research agency further passes on requests to delete to the third party services. Suppose all of this is happening, even this is not a deletion compliant, right? Because our definition of deletion compliant as strong as it is, is it excludes any data collector that shares information about one party to another, right? Like the fact that the agency communicated Alice's data to the data processing service at any point in time, immediately makes it non-compliant, right? Now again, this is perhaps what you want in certain situations. So this is perhaps what you want in certain circumstances for certain kinds of data. But there is also the possibility that you are fine with this, right? Like as long as the data processing service is good about this and as long as you pass on the deletion requests correctly, perhaps you're fine with this, perhaps you're fine with the research agency temporarily revealing Alice's data to the third party service, right? And in fact, we have a second definition in our paper which covers this kind of a situation where you're fine with something like this and that we call conditional deletion compliance under which this behavior is actually okay as long as the agency maintains its own memory correctly and passes on the requests correctly, right? I'm not going to go into this too much, but there's another classification that could be interesting and depending on the situation. And finally, let's say that the agency has all the data it needs and it wants to publish a paper. So it has everyone's data, including Alice's, and it computes some statistics and publishes a paper in some journal, right? Now think about the journal is that it's public and cannot be modified after you publish. Like it's an actual book that someone has on the desk, right? You cannot go and delete things from it. And let's say now Alice wants to delete her height. Is this going to be possible? Is an agency which does something like this going to be deletion compliant? In general, no, right? Because again, here the environment is this set of parties and the agency revealed the statistics to the environment and the statistics could reveal something about Alice, right? In general, it could reveal something about Alice. So in general, you cannot say that it is deletion compliant, but it turns out that in certain cases, this is deletion compliant. Even though this agency cannot do anything, right? Even when it receives a deletion request, there's nothing it can do. Everything has already been published. But if the statistics are such that they offer some kind of privacy that does not leak Alice's information, then you can actually show that this, even this kind of data collector is deletion compliant. And the specific kind of privacy that we are able to show this for, this differential privacy, which is sort of statistics being differentially private roughly means that if only a few of its inputs changes, like only a few of the users change their input or remove their input, then the statistics don't change by that much or the distribution of the statistics doesn't change by that much, right? Which essentially says that it doesn't reveal that much about Alice's data in particular for instance, which is why you can show that it actually does satisfy deletion compliance. And this is what I'm going to say about our definition and what it implies, right? So it provides this classification and we have seen both data collectors that satisfy the definition and those that don't satisfy. And I hope it gives you some intuition about what this definition does and the kind of things that we are trying to do and the kind of things that remain to be done, right? And finally, I would like to talk about some relationships between our definitions and our work and some other work that has been going on for more specific kinds of data collectors. Recently, there has been a lot of work concerned with deleting data from machine learning models where so you have some input data, data set, you have some training data set, you will train your model and then someone wants to remove themselves from the data set, right? And you want to remove that influence on the machine learning model. And roughly many of these, many of these learning algorithms, these deletion algorithms for learning, they work with definitions like this where they say something like learning from a data set which does not contain some element I should look similar to learning from the whole data set and then deleting the ith element from the learned model, right? And the challenge is in doing this efficiently because you can always relearn from the smaller data set but you want to do this efficiently and a bunch of papers do this, right? This is sort of like history independence for ML models, you know, where it doesn't matter whether you are there and then we're removed or you will never have to begin with. And consequently, you can also use this to get deletion compliance, right? For reasons similar to how you used, how you were able to use history independence or how we were able to use differential privacy, you can use deletion algorithms from machine learning to get deletion compliance. Like you can have a data collector which maintains data and a machine learning model and upon request can delete things from the model. Yeah, summing up, we saw, we discussed the need to classify deletion behavior in data collectors. So it would be precise about what we mean when we say different things about deletion. And we talked about our definition, deletion compliance. And there are a few things that we learned from this like the necessity to handle memory carefully, like to allocate a handle memory carefully to not leave traces after you're done. And something that we did not talk about is the necessity for a good authentication mechanism. If you look more closely, the low level details for the model, it emerges that you need good authentication to be able to satisfy our definitions. The other thing we saw is that sometimes you can use privacy to do deletion without doing deletion, without doing anything at all, just delete it. And finally, also that you can use specific deletion algorithms. The deletion algorithms in history independent data structures are deletion algorithms for these machine learning models to get deletion compliance. This is most of what I had to say. And finally, I would like to mention a few features that real world systems have that are not captured by your modeling. One example of this is scheduling and memory allocation when you have more than one process running on the system. In all of our modeling, we assume things like the data collector is the only process running on this computer. But in reality, that's not going to happen. There are going to be other things going on. Can you capture these things? Do you want to capture these things? This is murder. Another thing is that all machines in our modeling, if you look at the lower-level details, you'll see that they are all sequential. And concurrent processes could bring up concerns that you do not see here. There's conditions, what happens there? I don't know. And there are certain attacks that our model does not capture, like timing-based attacks where you look at how much time the data collector takes to do some operation and infer things based on that. It's really not captured by our modeling at all. And finally, leakage, it is allowed. Like maybe the data collectors allow to leak some bits of information about whatever it has deleted or whether it has deleted something. This is not really captured by our model as is. And finally, there are a lot of questions left open here. A lot of things to do, lots of things to explore in this space, right? Because these things are just taking off. These data protection laws are still new and people are still figuring out how to do things or what to do things or when you're required to do things even. Lots of interesting questions here and I will not go into all of them right now. I'll just leave this slide on here. I'm happy to say more about any of these questions later on if someone would like me to.