 Hi, everyone. This is Akamsha and I'll be moderating this session, which is preserving privacy while sharing data by God and have, and we'll be starting in another two, three minutes. I think we should wait for more people to join in. All right, then I think we are good to start now. I'm going to play the. We know if you are unable to hear of, or if there's some technical difficulty. So here we go. For the interruption, I'm going to try to reshare my screen so that you can hear him properly. Government employees in the US. And in some countries it's a norm that salary data is public. So you first of all have to determine what actually constitutes sensitive data. And this can also be complicated for reasons I'll get to in a moment. I'm a trusted aggregator in the last slide. But, you know, who can you really trust? A lot of private organizations sell data for profit these days. And also, you know, there's data breaches. You trust a company to keep your personal records safe. There is also some specific technical issues that relate to aggregating data. And one of them is the lack of data diversity and K anonymity failures is one term that's used in this, in this area. And basically what this means is that if it's known that you are part of a data set, that by itself can reveal something about you. Now, the fact that you're a US citizen, for example, doesn't reveal an awful lot other than the fact of course that you're a US citizen. But if you're in some particular health database, for example, that may reveal that you were tested for some particular disease which presumably increases the probability that you actually have that disease. And then, finally, and again, this is a whole field of studying into a lot more detail here, but there's a general susceptibility to certain types of attacks. And it actually turns out that when researchers study this kind of thing, a lot of aggregated data and anonymized data that I think the average individual would say, well, yeah, that seems like it's pretty anonymized and that ought to be okay. And researchers have come up with ways, some of which are admittedly probably a bit artificial in real world data sets, where it is at least in principle possible to re-identify individuals in those data sets in a number of different ways. And even if they can't do it perfectly, they can do it in a statistical way. So we can say, well, I don't know that this data point is this person, but there's a 30% chance of 50% chance. And in many cases, that's all that you need in some cases, say if somebody is trying to track down that individual. One of the particular attacks, and this is from a US Census Bureau slide, the US Census Bureau actually has a lot of good information if we'll dig into this kind of thing in more detail. And on the right hand side, you will have certain types of aggregation with statistical measures against those aggregates. The number of females, the number of males, the numbers of white, black, married, black females, etc. And none of those are one. They all are an aggregate of some number of individuals and we could increase the numbers here too. But really you can think of that as almost a set of simultaneous equations which can be solved for. And that again can be transformed then by solving that set of equations into some very specific data points that don't have a name attached to them. But our unique fingerprints, which at least in smaller data sets may make it possible to, again, statistically perhaps figure out what is the entire set of data that applies to this particular individual. We can also identify patterns. This is an online fitness tracker data point, instead of data points from some individual. And if you look at these patterns, you can look at that and go, hmm, this individual that this belongs to, I don't know their name, I don't know their address, I don't know anything else about them. But they seem to concentrate their time in a fairly limited area, which is where the bright white is in that screenshot. And we might infer the person lives right around there somewhere. Now, maybe they get in their car and they drive to a park and that's where they do their running. Maybe. But if, if, again, if somebody is trying to track somebody down, for example, or something like this gives them a pretty good idea. And certainly similar if this was done with GPS GPS tracks from automobiles or anything like that. Yeah, you know if somebody is always coming and going to this house. That tells us something about that is probably where they live. If they're going a lot to an other person another house somewhere well that tells us something as well about their patterns. A really interesting area and probably particularly relevant today and will be very relevant when we talk about differential privacy is this idea of linkage attacks so what we have here in the left is hospital visits for an individual. This is an anonymized in principle anonymized record there's no one's name here there's no social security number here. But it does include some relevant identifying information that date of birth gender and the zip code. Day of birth, we could fuzz that a bit by just changing that to a year, as I mentioned earlier. The zip code is presumably a five digit zip code not a nine digit zip code. Although even then zip codes can be pretty small when you get into more rural rural areas for example but let's assume for purposes of argument that we have a record that taken by itself is pre anonymous. Well, the same individual has other records that are out there and open data sets or even that may be legally required to be public for various reasons so voter registration in many places will have information like name address phone number. Well as maybe date of birth gender, almost certainly zip code. And again, you can argue, you know, how much of this is public in a given case but for purposes of our discussion here you know let's assume that there is some number of fields that might overlap with a send with a piece of sensitive data. We can start to do correlations and of course there's not just one public record there's probably many public records. And to illustrate this a particularly interesting example comes with number years back. When many of you may remember there was something called the Netflix prize and the idea with Netflix prize was Netflix released a whole bunch of records that were anonymized so no names, no user names, no, no zip codes, no. No information IP addresses that maybe could start to give a fingerprint for a user even if their name wasn't attached. And these records were basically, you know, Alice, really like these three movies hated these four movies, Bob loved this one movie hated everything else and so forth. And it was anonymized, supposedly. Now what these researchers did was they took that Netflix data, which was intended basically for researchers to develop machine learning algorithms to improve Netflix's recommendation engine, which didn't really work out all that well for other reasons to aren't relevant here. But in a case there was this large public data set. And what these researchers did was they looked at an other public data set, which is the, which is the rating information in the internet movie database. And again, you have users who like movies don't like movies and so forth. And in those cases, at least some of those users are probably able to be identified because they may have a name that they use across different logins they may use something that is their actual real life name their real true name. And by combining those two different assets, what the researchers were able to do was essentially look at everybody's fingerprint in the Netflix data and everybody's fingerprint in the IMDB data and discover that, you know, if somebody like these three movies and hid these four movies and maybe no one hates hardly anybody hates one of those four movies. And yet there's a very similar record in the IMDB database that has a very similar type of fingerprint. You start going, you know, that might very well be the same person and I'm not going to keep repeating this. This may be a, this may be a statistical inference we may not be 100% sure it's not a smoking gun, we can take the court. But it starts to be a pretty good indicator and maybe you can correlate them with yet a third database of some sort, which has a similar type of correlation pattern and we can start to de-anonymize the data. Now, nothing here is, you know, is probably, is probably not a serious matter if someone uncovers that, aha, he didn't like the Star Wars prequels, well, that anyone like Star Wars prequels, but that's a different matter. But certainly you can imagine this being a more serious case that we were talking healthcare records, for example. This is not a new problem and solutions to this are not new. Going back to 1930s, the US census, for example, stopped publishing small area data. And this is the sort of problem I was talking about earlier, you know, if you have a census tract that one family or two families lived in, and you publish that aggregate data, well, you're not really anonymizing it because there aren't that many people that could be in that small area. This, by the way, is a similar type of thing to what, say, Red Hat does with employee surveys where you ask, responds a whole bunch of questions, including questions about their manager, and then the aggregated data is published within the company. Data is also shared directly with the individual's managers, but is only shared with them if they have a certain number of direct reports, because after all, if they only have one direct report, the aggregation of the data of their direct reports is the same thing as how that individual answered. And even if you had two or three direct reports, you know, the manager in aggregate gets a bad rating, and they have a good working relationship with two of their employees and a bad relationship with their third employee. You know, they can make pretty good inferences from that. But let's fast forward to today and start and talk about formal privacy, which the US Census Bureau adopted in 2020. And specifically, the technique of differential privacy, which comes from a 2006 paper, so it's fairly recent, about something called epsilon differential privacy specifically, and the impetus of differential privacy is the fact that, as you have all these very large data sets, as machine learning has gone extremely powerful, a lot of the predifinals, in many cases some will add hawks, statistical disclosure limitation techniques, are really starting to be ineffective and intuitions about what was sufficient in their current anonymization, where becoming less and less useful. And the basic idea here is widely shared statistics over a set of data without revealing anything about the individuals that's the real objective here. And among the requirements, what be a formal model. So, again, not ad hoc, but having math against it, resist both these kind of linkage attacks I've talked to, and hopefully resist future attacks that we might not know about today. And importantly, and again is what main impetus behind it is to be effective in places where there may be a lot of these external data sets available. The basic way differential privacy works is injecting random data into a data set in a mathematical rigorous way to protect individual privacy. The value of the randomness trades off privacy and utility and accuracy. So, if you look at the right hand side here, you the idea is that you data is aggregated by a trusted curator Alice. You have a querier who you can't trust not to try and get at sensitive data, the querier puts query in to the curator, the curator comes up with an answer, and then the curator adds some noise that and that noise is mathematically guaranteed to be efficient to keep the querier from being able to de anonymize the subject of their query. And the other way to kind of think about this is that you have this real world computation so you've got a query request, do some computation, get an output. You have a different input that that doesn't include the data from somebody on an individual. The computation analysis of that output it the difference between those two is at most and a value of epsilon. And so in other words, you can have a data set and you, you, you don't know mathematically whether an individual person is in that data set or not. There are some limitations here base rate so basically, if I know certain public characteristics about you such as sex age and so forth. And for certain things about you, such as your likelihood to come down, come down with some disease without knowing a private information about you, having anonymizing data doesn't change that fact. The noise is something of concern in that there is this idea that you are injecting noise into data so at some level the result isn't as good as it could be otherwise. And this was a concern by a number of researchers with the US census use of technique for example. Subsequent research suggests that in general, you can you can strike a pretty good balance here between I get between protection of sensitive data, and the accuracy of the overall aggregated data set. And one of the most difficult problem here is the idea of repeated queries I mentioned this epsilon is epsilon value earlier. And it's probably a better way just to talk about is your you can set privacy budget. The thing is you use up that budget, every time you do a query against to save data, which is a particular problem in a lot of modern countries where you're doing digital queries rather than just having an some aggregated tables that you can that you can access to. And this there are ways to deal with this so for example, a randomized subset of data can be can be used. And once you read the privacy budget has exceeded you can have a different randomized subset of data so there are techniques there, but nonetheless it's a limitation. All of this is assumed that we have a trusted third party, but what if we don't. And that's where the multi party computation I mentioned at the beginning comes in. This is essentially collaborative analysis of silo data sets without trusting a third party. Essentially you have a protocol that is equivalent to an incorruptible trusted party and conceptually some of this is similar to how enterprise distributed ledger technology blockchains work. The basic mechanism here is that parties can jointly compute a function on their own, an input a share of inputs using a protocol without information about those inputs being revealed. The objective here is to preserve privacy and correctness. At the same time there, there is an assumption and there was an assumption that some number of participants will be trying to break the protocol will collude with each other. Exactly how you implement multi party computation does somewhat depend on the threat model that you're assuming so if you're assuming that one party may by design or by accident reveal data. If they could, that is a different threat model that if you assume 50% of the parties may include in a way to sort of pierce the veil so to speak. And then you also need to think about how much overhead is involved. It works by you have a protocol distributing encrypted, specifically AES shares of data. As I said the implementation efficiency depends on threat assumptions, and although there's different overheads with different threat models and so forth in general we can say that unlike safely homomorphic encryption, the compute overhead is fairly low. But there's a lot of communications overhead because all these cryptographic encrypted communications taking place between parties. One specific use of this came out of Boston University in the city of Boston, where participating companies shared their individual wage data, but in a way that was protected by cryptography so that they weren't laying any third party. They actually see the unencrypted data shares. If you are interested in this topic. We've actually written and had some interviews on red hat research quarterly and give you a plug this is what the organizations at red hat I do a lot of work with. I suggest you subscribe to our newsletter. There's also a related website that has a lot of information about ongoing research happening at Boston University and other universities that are associated with with red hat research. Boston University red hat Collaboratory work some of that's taking place. And I mentioned at the very beginning, Andrew Trask and open mind. If you're a start that likes to get their hands dirty and this stuff. There is a Python set Python libraries there that let you play around with differential privacy and multiplayer computation. Yeah, thank you. And we have some time for some questions. Does anyone out there have any questions. Yeah, as I did drop in the chat. We did privacy was actually the topic at this week's we split up red hat research day into some topic oriented shorter sessions this month. We, we had a session and privacy earlier this week and we'll have video up from that, hopefully fairly soon. So if you're interested in this topic, that'd be a great place to go for some more. Well, if nobody has any questions. I will sign off you can always reach me at g half at red hat.com if you have any questions and I'm also on Twitter with the same handle assuming I can get my Twitter working properly again one of these days. So thank you everyone.