 And this talk has a lot to do with balance. How do we balance privacy on one side? But on the other hand, we want open data. Open data has a lot of value. How do we use it to innovate, to improve things? And at the same time, how do we protect the people that are behind these data sets? One way to protect everyone is to lock everything down. Let's stop sharing data. Let's stop sharing data with researchers, with the public, with partners. And let's close down this conference because we are not going to share data anymore. But there is a continuum. We can start just sharing all the data. On the other stream, we can delete all the data. But then we can also use techniques like identification, aggregation, and just creating synthetic data. That is like real data, but it's not real data anymore. And the question is, where is the right place? What should we do? I don't have that answer. There's a lot of policy laws to follow. Don't ask me what's the answer that comes for each data set. But what I do have is a lot of ideas and tools that you can use to further down this process. So yes, I work at Google. Our mission is to organize the world's data. And while we are doing this, we have to develop a lot of these solutions. And we are sharing what we do internally as tools that you can use too. One of these tools is the Google Data Laws Prevention API. This is an API. It's open. It's ready for your use. I've been experimenting with it. You can experiment with it too. And it has a nice, free usage tier. So let's talk a little bit about sensitive data and security. What's wrong with this picture? Is there something that shouldn't be there? A credit card, a credit card number. That's a valid credit card. It's not a real one, but the number is valid. And we have lots of data in many places that we may want to obscure. In this case, with the DLP API, I can just feed it pictures too. And the API can find where we have sensitive data and eliminate it, block it. And this is configurable. This is automatic. And there's not only credit cards. There's a lot of sensitive data that we may want to hide or transform. So PII, that helps identify people. Financial data, health data. Data, as we saw right there, it leaves not only in documents, also leaves in pictures, also in databases. It comes from data that we collect, data that comes from our employees, data that comes from partners. And here we have an example of the API at work with different kinds of data. How do we transform data? How do we hide half a feed? How do we look into text, like free text, where people might write names, people might write telephone numbers? And we have the ability to decide how much of that we share, not only as open data, but also with people in our organization, with partners. Even if we are not going to open a data set like this, we still have to protect data between our walls, within our walls. So with this API, we can classify what kind of data we have, we can transform data, and we can also do measure, identification risk. I'm going to focus on that in a couple more slides. So yes, when we want to DAD data, for example, we have a message like this, we record it somewhere in our databases, and now we have to share it with other employees for analysis, we want to feed it to a machine learning algorithm, or we want to open it up. On one hand, we can identify what's sensitive. We have names, we have telephone numbers, we have a social security number, ADs, and we can decide how we transform it before sharing it. We can replace the names, we just attack that there was a first name there, we can replace it with a hash, with an encrypted way of the social security number, so we can come back to it if you have the right key, or you can even, for example, replace half of the phone number and only leave some digits if you want to. Same thing when we are identifying data, if we have the title where each person's job, we can transform it from senior engineer to just engineer from lawyer, so the difference of managers to all just ops, redact, et cetera. This is when we start measuring how good of a job we did while de-identifying. There are measures like K anonymity that shows that for the groups that we're releasing, if we group the data that we released, there are some groups that have a lot of samples and there are other groups that have only one sample. For example, if we just publish the title of each job, there will be only one CEO and we want to measure that, so maybe that data point is to identifiable. So who are our liars? How do we measure that? So let's give some details on that. We have identifiers, things that are your full name, your password number, et cetera, and there are also quasi-identifiers. We think those are anonymous, but sometimes it's not. We have a developer advocate from Santiago Chile speaking there. Maybe the Google only has one of those. Maybe it's only me, so just by erasing my name, you're still pointing at me. And there's also sensitive data. It's not, we don't use it to identify people, but once we identify them, we may not want to know their health conditions, their salary, things they might want to hide. There are three measures that I can just start measuring now in any dataset. That represent how anonymous a dataset is. K-animity, that's one of the most well-known, a lot of people use it. A diversity that tries to protect people further, and K-map is one that we are proposing that is being led by some of Google researchers. So let's look at a real dataset. Mexico publishes every birth that's happened every day in the last few years. I have all of that data, more than 12 million babies born between this time period. And we have data for the babies born, and we have data for the mothers. For the babies, let's, from all the variables that are published here, let's take it forward. Like, it turns out I met someone that was born in 2008. I know the gender of this person, something that might be easy to know. And we can ask them, hey, so what did your mother do? What's her education? Was she married? No, it turns out she was not. Just with those four variables. Can I identify people in this dataset just by knowing where they were born and what, two facts about their mother? Well, it all depends on the combination of those values. So, this is how I would call the API. If I wanted to connect it with code, I asked the API, look at my huge dataset, look at these four variables, and this is my table. And I get results like this. How do I read this? For example, for someone that was born in Nayarit, in Mexico, their mother is a widow, their mother didn't have any education, and their woman turns out there are only five women that were born in this, during these five years that have these properties. And there are 189 groups that have the same number of people. So, with these five variables, that might be in the first. With other values, 14 is five variables. We have 188 groups. We have almost 1,000 people living in these datasets. Even more, with a combination like this, we have unique people that we can identify. Only one person had a mother born in Gitana Roa without no specified profession with us. And the question is what do we do now? Do we erase that Roa? Do we try to anonymize it, aggregate it with other roles? Or do we try to, or do we just preserve this data? It's not clear what we should do, but at least we should know that we are releasing this kind of identifiable data. And that's K-anonymity. In this case, this data set that has been released has a K-anonymity of one, because we can identify some people just by these four variables. Now, L-diversity is another value. Usually, for example, in medicine, a lot of people use K-anonymity of five. Let's have at least five people in each group. But that's not enough. So for example, if we look at divorces who had babies born in Zaragoza that have post-graduate education, let me look at this full data set and it turns out that there are only five women that have those match that pattern, and that's okay, K-anonymity is five. But it turns out these five women have something in common. These five women have hepatitis B. That's so for even if we have five or if we have a huge group, if all of them have the same data, suddenly we are still revealing their data and now L-diversity here is one. So we might also want to look for ways to make L-diversity bigger. And let's say K-map, this is another value that we are proposing. Let's say we have two unique rows in our data set. We have that the ZIP code is 85, something for some person, for the other one is 60, H79, H42, these two people are unique in our data set. If it's okay to publish, K-anonymity is one. If it's okay or not, depends. Turns out in one ZIP code, you have a population of 20 people. So probably there's only one person that's 79 years old, so the anonymity of that person is not protected. In the other ZIP code, it has 100,000 people living there, so we have one person of 42. It will be very hard to find out who that person is, so that person is protected. So just measuring K-anonymity is not enough. We might want to hide it in some other way. For example, let's remove the H. Now, if we remove the H, they will all at least have a bigger protection. And to do this kind of analysis of attack, we need to bring in other data sets. For example, the census of the United States, because otherwise it's not part of our data set. We need to join it with richer data, and this API that I'm showing you would do that job. Other real data sets, the New York City taxis that we talk a lot about, more than a billion taxi trips have been published. The first time they published the data, they included an identifier for each taxi cab driver. And people finally hacked that and found which taxi driver was, and that was not good, so on the second iteration, they removed that data. Which protects anonymity, but on the other hand, the kind of problems we want to solve with this data is harder to solve. So should they have removed it or not? That's an open question, but at least we can measure it and we can decide if we want to remove that column or do a different kind of aggregation. And now in the latest release, they have removed the specific latitude where each trip happens. So we keep removing data, and the data set starts being less valuable but more protected. Where's the choice? There's a lot more research. I'm going to keep my talk 20 minutes short because that's what I should do. But for example, this is when we had to, we decided to release one of our logs. A lot of people have been asking for Google Analytics samples. We have not provided them because providing web logs is hard, but recently, last month, we started providing this data set. There we had to do more of these choices, removing some of the allocation data or changing, for example, the exact time. And to do this, we are using our APIs too. So how do we de-identify these things? One way of de-identifying is changing, for example, we have this phone number. How do we change it? We can do, for example, partial masking. The API supports that. We could hash or tokenize it. Then only some people will be able to get back the original number. Or even more interesting, you can go and change the real phone number for something that looks similar, that looks as if it was a real phone number, but it's just things that we are changing at random, but it's not so random because we can go back to the original result if we wanted to with the right private key. Other ways of doing things, how do we bucket data? So as was mentioning earlier, we have different types of engineer and privacy is not enough. This API can transform it bucket, all kinds of engineers into engineer, all kinds of operations, people into operations. And that anonymizes, protects privacy further. We can do also some date shifting instead of including the real date. We can move things around a little bit forward or backwards. The problem with doing that is that then we may lose the relationship between dates. But again, this tool allows me to have the sequence preserved. I can do some date shifting, time shifting, but things will still happen in the order they had to happen. We have a lot of privacy metrics, privacy identifiers that we can identify and we keep adding them for several countries. And that's because we work in many countries so we want to keep growing this list. I copied some. My analogy here of privacy versus utilities, what do we do with matches? You know that a lot of people get burned every year, kids play with matches and that's super dangerous, fire is not good, but also fire is good. Like if we were to eliminate matches, we wouldn't be in a better place anyways. So there is a choice that we have to make. Some more URLs. This is a cloud data loss prevention API. You can go there, you can get a free tier to play with. Some considerations, we can continue talking about how this de-identifying and privacy impact machine learning, the data sets that we use there. This is Ted Desfontaines. He's one of the researchers working into implementing this API. He has published more about measuring identification. He's the one that came up with KMAP and he has written about it. And if you want to find me, you can find me on Reddit, Twitter, Stack Overflow, give me feedback if you want to give me feedback. So thank you very much. Thank you.