 Hi everyone. I'm going to have technology evangelist at Red Hat and today I'm going to talk about how we can share data while maintaining the privacy of individuals in those data sets. Now on the surface this might seem like a pretty simple problem. You just strip off obvious identifiers, things like names, addresses, social security numbers, or other public identifiers or semi public identifiers that might be traced to an individual. Again, it seems straightforward, but let me tell you a story from the 1990s. In the mid-1990s, the Massachusetts Group Insurance Commission decided that they were going to release anonymized hospital records of state employees for medical research. Now, this might seem like a pretty good idea. Massachusetts was and is one of the centers for both hospitals and for medical research of various types. In fact, one of the early COVID-19 vaccines was developed by a company called Moderna, which is in Kendall Square, an east edge of the MIT campus in Cambridge. So what this group did was they released hospital records for state employees. They weren't stupid. They took off the names, the addresses, the social security numbers, and they figured that should be good enough to preserve anonymity. Well, what should William Well, the governor of Massachusetts at the time, find out a few days later? An MIT graduate student by the name of Latanya Sweeney had basically done a linkage attack on this data. And what that means is she took this data set, she took voter registration records, which are public, and basically linked them together, basically using the birth date, the zip code, and the sex, which were seemed innocent enough to be in this data set and might very well actually be relevant to the use of this data set, and combined them with the voter registration records, which is a matter of law or public information. Latanya Sweeney has actually done a lot of subsequent research in this area, and she's found that basically having those three pieces of information are sufficient to re-identify individuals and medical records, a very high proportion of the time, one of her studies was about 85 percent. Another example of a linkage attack comes from Netflix and IMDB data. Some of you may remember there was this Netflix prize contest a number of years back, which Netflix made some anonymized data sets available in order for people to compete to come out with better recommendations. Now for various reasons the project didn't really work out, but that's not really that relevant here. So what we have is we have this anonymized Netflix data, so you know this person liked the Terminator, they hated Transformer movies, they good taste in other words, and a whole bunch of people like that. The researchers then looked at IMDB data, which is, you know, which is sort of the list of movies and people can put ratings for movies there, and the identities in IMDB are not necessarily public names, but in some cases they may be, or they can be correlated with yet other databases to public names, maybe someone uses the same handle across sites, and here again people rank movies. Combine those two together and the researchers were able to find some patterns and to correlate one group of people with another. Only in a very statistical way. It's not perfect. This isn't 100% sure. You know this anonymized database record here is Alice over here, but you can certainly start to draw statistical inferences. So what can you do about all this? Let's take a look at an example that's at least similar to the William Weld example here. You have a private data set, which is all the information about a hospital visit, and you have a public data set, which is something like voter registration data, and then you have certain attributes that link those two together. So let's think about those attributes. Well, gender isn't terribly interesting because that's a very, very large bucket, so that's not going to tell you an awful lot. Zip code, if in the U.S., is a five-digit zip code, usually that would probably be okay. A nine-digit zip code, which is obviously more localized, that is a stronger identifier, so you might not want to do that. There are some edge cases there in sparsely populated areas. You might think about using a county, for example, which depends on state, but it's a larger administrative district. The one that's probably most problematic here is the date of birth, because that really starts to give you a fairly precise fingerprint. What's very common with this type of situation is maybe you'll substitute the birth year rather than the entire date of birth. Now, any of these making buckets bigger have downside, so in the case of the date of birth, you might lose some seasonality effects. In the case of, say, county rather than zip code, there's a bunch bigger lump of people, and you may be losing it out on some environmental variables that could shed some light on, but that's the type of thing you tend to do. Certain types of data are also just hard to anonymize, because there's effectively, again, at least statistically, identity information embedded in the data itself. So this is an example of GPS tracking in a fitness app. Now, this doesn't identify exactly a person, but you look at that hot spot in sort of the middle of this slide, and you go, you know, that may be the neighborhood where the person that this is associated with lives, and if we also see them going to some place else that is interesting for some reason on a regular basis, we might start to correlate identities. Obviously, if we talk about something like GPS tracks for automobiles, for example, that is almost certainly going to allow us to identify where the person lives, which is probably a matter of public record, at least if they own a home, it's going to identify where they work, and may identify some other things, which that person may not want us to know. So some things are just very hard to make public at all without sort of giving the game away, so to speak. You can also often aggregate data, which is what the US Census does, so that there is no individual data record outside of a trusted curator like the US Census in this case. But aggregation is not a silver bullet either, as US Census recognized fairly early on. For example, if your aggregate table is over too small an area, you're not hiding individual data records that much, and you know, the example of this that certainly anybody at Red Hat listening to this is familiar with is something like our associate surveys. The aggregated data is made available to the company as a whole, and data is also shared with managers, so they're a particular group, but it's only done so if the manager has a certain number of reports, and it's easy to see why that's the case. In the extreme case, if a manager only has one direct report, well then that data isn't going to be anonymous at all, but even if a manager has a few people reporting to them, they may still be able to make some inferences and guesses based on responses and what they know of the individuals in question, and the US Census is in the same boat here, and they've done a number of things over the years in order to protect individual data records. However, it's always been sort of an ad hoc way, and in fact, if you read through some of the history of this, and US Census actually has a lot of good information in their site, if you're interested in this kind of thing, you'll see this sort of pattern of, oh, we tried this in 1970, and it hit this problem, so we tried this in 1990, and then the most interesting thing is formal privacy, which they did in this last census, and that's what I'm going to spend the rest of this session on. The requirements for what somebody like the Census Bureau is looking for are basically fourfold. First of all, you want formal models, so you don't want to be ad hoc, sort of squinting at some technique and going, yeah, that ought to be good enough, like the Massachusetts Group Insurance Commission did, essentially, in the 1990s. Secondly, you want to be resistant to linkage attacks, such as we've been talking about. That's not the only way that you can de-anonymize data, but it's certainly a very powerful way. At the same time, they also want a technique that isn't just about linkage data, but could also be resistant to techniques that we haven't really thought of yet. And then finally, this has to be in the context of an increasing number of external data sets that can be linked in various types of ways and therefore make it easier to de-anonymize. In this case, formal privacy refers to differential privacy, which is a relatively recent technique usually dated to 2006 with this paper by Cynthia Dwork and others, although this was based on some earlier research that was in the same general vein. And essentially what differential privacy is, is it's a mechanism where there's a certain amount of noise inserted into the data set so that if you take one person out, the noise is calibrated in such a way that you don't know if that individual is in the data or not. And the other way to think about it is that there is a certain privacy budget. And what's really different about this is that with the requirements I just mentioned, that this is a formal technique. Now, it's not perfect. Don't get into some of the limitations, but it is formal. It's not this sort of ad hoc, oh, let's try this. And it seems like this ought to be good enough to protect privacy. So we have here three people, three data providers who have personally identifiable information that they are willing to have aggregated. They send this data to trustworthy Alice. So Alice in this scenario is a trusted curator who is trusted not to ship their entire database off to somebody else. Now, there are other technologies such as multi-party computation that aim to get around this need for a trusted curator, but I'm not going to go into those today. Now, we have Shifty Bob over here and we don't really trust Shifty Bob. We worry that he may try to get access to personally identifiable information from one of the data providers. And Shifty Bob asks a question of Alice, tell me about something. Trustworthy Alice needs to respond. Now, trustworthy Alice could just send a raw answer to Shifty Bob. However, as we've seen, there are various problems with just shipping a raw answer off to somebody who's sending a query to the database. So with differential privacy, what happens is there's noise inserted. And this is a level of noise, this privacy budget I mentioned earlier that is mathematically calculated to obscure any individual's records in that database. So rather than getting a raw answer, Bob gets a private answer or an answer that has been privatized in some way, shape or form, so that Shifty Bob can't ferret out any information out of that answer that he is not supposed to be getting. And as I mentioned earlier, there's effectively been a person removed from that database or the equivalent of a person removed from that database. So Shifty Bob has no way of knowing if one individual is in that database or not. So that's differential privacy at a very high level. There are some limitations to differential privacy. First of all, there's the base rate. So you know somebody is in a particular database or you just know certain public characteristics about them. That may tell you, for example, their actuarial likelihood to get a particular disease. And the fact that you have differential privacy that's sort of hiding maybe their specific medical records doesn't get away from you knowing certain things about them. There's also that that noise that's being inserted. And this was actually a concern among a number of researchers when the US Census Bureau announced that they would be doing this, that you were making the data worse at some level. There has been subsequent research in this area. And I think the general consensus it's fair to say is that this isn't really a problem if it's done right. Probably the most serious limitation is this issue of repeated queries. Now if you use differential privacy for published tables, you know there's really only one query there if you would. You've put the table together and you can run all these statistical tests. But if someone can ask questions over and over again of a particular database, they can use up that privacy budget that you've inserted there. And there's various ways around this. So for instance, the queries can be against maybe a randomized subset of the overall data. And once the privacy budget has been used up, you recreate a different randomized subset. So there are ways around this, but in a world where you're querying data, there are some limitations to know about. If you're interested in more here, I encourage you first of all to subscribe to the Red Hat Research Quarterly. We've had some interviews and articles on this topic in there. You're watching this at DevComCZ virtually. Check out our virtual Red Hat Research booth and there will be folks there. We'll be happy to talk to you. Also check out the Boston University Red Hat Collaboratory where some of this work has been done. If you all get your hands dirty, one place you can do so is an open mind and they do have a Python library which works with differential privacy and also some other privacy preserving techniques. And just check up again, this is in Research Quarterly, the Harvard Dataverse, which is a research oriented collection of data so you can read about some of the things they've done. So with that, thank you for your time. Good to see or sadly not see everyone. Look forward to seeing people in person hopefully next year.