 All right. I think we were just about here. And I appreciate all these folks coming here in the cleanup slot. But I'm going half. I'm a emerging technology evangelist at Red Hat. And the genesis of this talk is, and I have some links at the end here, is the fact that Red Hat research is doing some work with a number of universities in this area, most notably Boston University in Boston, around some kind of new techniques that sort of allow the use of data without compromising sort of the privacy of that data. And I'm going to get to that in kind of the back half of this presentation. I'm going to lead off with talking a little bit, kind of taking things up a level and sort of explaining how we got to where we are. The reason why this is such a hot topic for conversation right now is that zombies eat brains, and machine learning eats data. And I will say that this is not exclusively a machine learning problem, but the fact that we really need data to do machine learning, we need a law of data use machine learning, has increased the sort of the urgency of these types of problems. Furthermore, machine learning and things like image recognition can do a lot of pretty cool stuff in the medical area, for example, in terms of recognizing tumors, recognizing glaucoma and so forth. And there's a law of hope that we can improve things like medicine by using big data in this manner. The problem is that data can be rather private. Health data in particular can be very private. Financial data, a lot of this data we might want to use and that people in general might be very fine with us using so long as they're not personally identifiable. Now, this is not in general a new problem, and we have this idea of anonymization and pseudo-anonymization, basically the same thing. One service says you strip off the identifying information. Maybe you never collect it in the first place, like an anonymous survey. The other service said, well, you have, yes, we keep this as an individual record, but we are going to substitute some trusted organization that's going to substitute a token for, let's say, a person's name. How do you do this? Well, sort of naive, you might be, this isn't really that complicated. You just remove a personal data field. You remove the person's name. You encrypt or transform personal data fields. For example, one thing you might do is if there's a birth date in there, for a medical thing, you might care how old the person is. You probably don't care what exact date they were born on. So instead of 1180, you'll just put 1980. In the third area, and I'm going to mention the US Census a number of times here, because partly because there was a new census happening this year, and they're using some new techniques in it, but you aggregate it by a trusted agency. US Census has all the personal data, but you trust them to not use it and not reveal it in a way that people can be identified. Do the traditional techniques actually work? I mean, they do to a degree. I mean, we've been doing this kind of stuff for a long time, and it mostly works. But first of all, what's personal data? What do you consider private? Things that, for instance, like salary information that a law of Americans, for example, consider private information. If they work for a government agency, that might be public. Some countries in Europe, that is public information. Who can you really trust? I just said we tend to trust the census with our data, but with all kinds of data breaches and everything else, where can it be hard to figure out who you should trust? I'm not going to get into a lot of the real nitty-gritty of anonymization here, but just to give an example of a problem, a lack of data diversity, something called canonimity failures. What does this mean? Well, let's say I know you are in a particular database, but I don't know anything else. Oh, everybody in that database has a particular disease or a particular class of diseases. Just knowing that you were in that data tells me a great deal about you. And there's actually been a lot of recent research in this area, in terms of identifying data, even if it's supposedly anonymized. This is from a deck from the US Census again. One of the things, for example, that you can do is what's called reconstruction. So if we have this table on the right, this is essentially some statistical aggregation of data. And you can basically write a bunch of equations that sort of represent this data. And it turns out that you can actually then solve those equations, essentially, and come out with the table on the left. Now, in this case, that data does not have an individual's name, but it does have a fingerprint, essentially, for individuals. And similar things can happen with, say, browser signatures, for example. You can start to, if you have enough of this quasi-identifying data, you can start to come up with a fairly unique representation. Now you're saying we still don't have their name or we don't have their address. True, which brings us to re-identification. And this is essentially merging anonymized data with other data sets. And guess what? Those other data sets are increasingly common and increasingly easy to combine. And even if we're not talking private data sets, which you'll all feel you're getting are unhappy these days about the availability of credit databases and that kind of thing for sale, but some data is public as a matter of policy. Things like real estate transactions in the US or public data, yes. I mean, people can shield them behind trusts and things like that sometimes, but basically that kind of thing is public data as a matter of public record. And a lot of this data that is public as a matter of policy, that may have seemed fine when someone had to go to a dusty county clerk's office and pull something out of a file cabinet is a lot easier when someone can just go online. One example here, and I couldn't find as good an example of this as I would have liked online, and I was too lazy to construct one from that day on. But this is essentially someone's, I think, Strava, a GPS fitness track where they are, running around or cycling or maybe driving here. Now, as I say, this isn't as good an example as I would have liked to have found. We don't know who this person is, but I think we can probably start to conclude that maybe they live somewhere around there or at least work somewhere around there. And yeah, that didn't reveal an awful lot in this case, but imagine, but in my case, for example, I own a house, which is a matter of public record. And I'm sort of in a fairly rural area and I can guarantee you that if I did this with the GPS in my car, you would know where I lived. Fairly easy, it'd be pretty obvious. And for that matter, and then once you have my name, you can go online and find out a lot more about me. I don't actually go into Red Hat's Westford office very much, but if I did, you would clearly know where I worked as well. And oh, then there might not, I really don't live a terribly exciting life, but if I did, you could then take that record and go, oh, that's interesting, that spot where he goes all the time. This is sort of a, that is sort of an example of kind of the broader thing here, often called linkage attacks. And you'll hear again, we have an example of a public record voter registration. Now, depending upon where you live, there may be more or less of that information in the voter registration. I don't actually think my date of birth is in our local, in my town's voting records, but probably not phone number either, but some of those things are. And again, once you have those, that sort of quasi-identifying information there, basically can link these two records. So if I have this record here that doesn't have a name attached to it, I'm going to have a fairly, let's start to have a fairly unique fingerprint there that I can then link to this other information. This is actually a really fun example of re-identification. I think it's fun also because it shows just how powerful these techniques can be in ways that are, you wouldn't think, how can someone do that? And some of you may remember the Netflix Prize a number of years ago, which would essentially, basically gave researchers a bunch of anonymized viewing records from Netflix with how they had rated movies and that kind of thing. And the goal was to come up with an algorithm that did the best job against a training set of data to kind of predict what other movies these folks would like. Well, some other researchers said, oh, that's interesting. So they took Netflix data, the anonymized Netflix data. And they also looked at the IMDb movie database. And what they found was that, not with anything like 100% accuracy, but they could basically cross-match those data. Oh, this anonymous Netflix movie watcher loved movies A, B, and C and hated movies D, E, and F. And oh, Alice on IMDb, hmm. She really liked A, B, and C and hated D, E, and F and so forth. And given enough data, you can really start to at least make statistical inferences if not absolutely certain picked specific movies. So this kind of brings us to sort of the researchy part of here. And these are three areas that, as I say, right at research, we have some PhD students working in these areas. Going mostly talk about the first and the third here, but I will mention the second for completeness. So differential privacy is essentially a response to the fact that all this re-identification and everything is sort of eroded the traditional techniques in this area. I won't say the traditional techniques are completely ad hoc. There is science associated with or has been historical research done there, but a lot of it is sort of heuristics and rules of thumb. And basically what we want to do here is to widely share statistics over said data without revealing anything about the individuals within that data. So US Census, for example. And this basically, some of the ideas here have been around for a while, but this basically comes out of 2006 work with, I forget where first name is at Harvard, Dr. Dwark, an Epsilon differential privacy. And essentially what differential privacy is that's different is it's really a formal model. So there's math here. This is not just heuristics as has often been the case in the past. And the idea here is to at least resist these kind of linkage attacks. Anybody involved in this is very deliberate not to use words like prevent here. And I'll also get to a couple of the limitations of these techniques, but it's certainly something that we're making advances on. And the idea here is that you inject random data into a data set in a mathematically rigorous way. So you fuzz up the data and observe what that Epsilon is basically trades off privacy and the utility or accuracy of the data. So at one end stream, you totally randomize the data and everybody's safe, the data is also completely useless. And the other hand, you don't inject any noise at all and you're basically not protecting privacy. So the idea, the way this is expressed is you have this idea of you have a real world computations, the actual data set. And then what if you input it without the data about one person? So you're taking one person out of there and you're getting an output, the difference is at most, this Epsilon and the idea here is if I can't tell whether that one individual is in the data set or not, basically if the noise is such that someone could be in the data set, someone couldn't be in the data set and I wouldn't be able to tell the difference from the output, that is essentially what differential privacy is. And as I say, the census is going to be using that this year. There's actually been, there's an article in the New York Times, did I include that headline? Yes, I did, so they're going to be including that this year and as I say, a lot of the ideas here have been around for a while. Going back to 1930, they sort of said, oh, if we publish statistics about this 10 person town, that's probably not such a good idea because even though it's grouped over a number of people, if you have a small enough group, it doesn't do you much good. There is some controversy about this. Number of researchers were complaining that, oh, the census is going to screw us by providing us with less accurate data. The fact is that there has been research done and kind of how much fuzzing of the data you do and does it really affect things and there's consensus in the research community seems to be not very much. The other problem is figuring out how much the noise to do have and as I say, it's a formal process but one of the problems here is that you use up that epsilon effectively each time you make a query and particularly with interactive queries where you can, someone could in principle run an unlimited number of queries, well, you can still go around differential privacy that way. Now, what you do is you only release subsets and once somebody has made a certain number of queries, they have to switch to a different subset and you do things like that. So that is kind of dealing with sort of aggregated statistical data. I'm now going to talk briefly for the last part of this presentation about some of the ways that you can essentially, you have a data set and you don't want anybody else to see it, you don't have that trusted third party. One set of techniques is something called homomorphic encryption. The idea here is that you can ship off a bunch of encrypted data to a cloud provider and then operation is actually done on that encrypted data and then you get the data back and you can decrypt it. This uses techniques called last-based encryption, dates about 2009, is very computationally intensive and basically it's not very practical at this point. So it's not, today it's not super interesting although it's an idea. The related area is something called multi-party computation and the idea here is to allow collaborative analysis of silo data sets without actually trusting a specific third party. So you don't need a US census that you trust in order to do this. And basically, conceptually it's a little bit like some of the private distributed ledgers slash blockchain mechanisms and that there is shared, in this case, there are shared secrets that you're sharing among the participants doing the computation. So some of the considerations in terms of developing here is preserving privacy and correctness. What are the threat models here basically? Maybe one participant or one-third of participants or core participants can be trying to cheat and trying to penetrate the shield. How much are you worrying about this environment or just sort of looking for some fairly simple guarantees? And that's the sort of thing right now that there's a lot of research going on in. I will mention that this has been used, this is an example from Boston University, essentially a number of companies, the city of Boston wanted to look at gender differentials and salary. And basically a lot of companies said, we'd be willing to share our, in principle, we're willing to share our data. In fact, our lawyers do not want us to share that data. We're happy to participate, but you can't see our data. So that's kind of an impasse. And basically BU used multi-party computation for that. They used a little bit of an atypical setup and BU still did the actual computation, but things were set up in a way with distributed shares that the data analysts couldn't actually see the data. So that's a little bit of a blended mode. In general, there isn't a law of compute here, but there's a law of communications overhead as you can imagine with these secrets being shared around and moved around and parts of computations being combined together. This is sort of an overview of kind of this space, the way I think of it. So you have things like, I just talked about like homomorphic encrypts should serve about dealing with the input privacy. ZK proofs is zero knowledge proofs. Multi-party computation also does some policy enforcement. Arguably trusted execution environments like NRX are different, but they're solving some of the same problems of not trusting the underlying environment. And then differential privacy is really about kind of the output of the data. You trust the data coming in, you trust that the data has been secured through a third party, but then you have to aggregate the data. Ongoing research, some of you may have picked up the Red Hat Research Quarterly out there earlier in the weekend, but if not, you can subscribe. There is an article, and I think it's issue two that goes into some of this stuff. One other thing I'll mention is there is a Python library that works in conjunction with PyTorch that an organization called OpenMind is working on that implements multi-party computation and differential privacy. So if you all actually play with some of this stuff, and I think they have some sample data sets and so forth. So with that, thank you all. I have like a minute or two for questions, yes? Yeah. Oh, okay. Oh, cool. Yeah, I listened to a talk by Andrew Trask just a couple of weeks ago, and he specifically mentioned the Python, but yeah. Mm-hmm. Well, yeah, I mean, there is a whole set of concerns here which I didn't touch here at all, that I think to maybe give an example of something that you're talking about. If you train on essentially biased data, you get biased models. So for example, Amazon trains a hiring algorithm on, oh, here's what our successful employees are like, oh, we should hire more of them. Oh, you say the gender ratio is really off. Well, we're just doing what the model said. And that is definitely a big problem of, is the input data you have, in addition to holding private, kind of private data, is it appropriate? One more? Anyone? Okay, thank you all.