 First, thank you all for being here. I flew all the way from San Francisco to be here. So I don't know where you flew from, but I see some familiar faces. I see some cloud Google t-shirts. My name is Felipe Hoffa. I've been working at Google for the last eight years. I started as a software engineer. And for the last six, I've been a developer advocate. Do you know what a developer advocate is? Yes. Or, as I call it, a software engineer with a license to speak. So I love to go into conferences. I love being on social media, Stack Overflow. You can find me in many places. And I love analyzing data. I love public data sets. I've been working with public data sets for a long time. And every time I give a public data talk, people start asking me about what about privacy? What are the risks of sharing so much data? And that's how I got interested in this topic, how we can protect sensitive data in Q2Data sets. Or at least how can we measure the risk? I'm not a lawyer, so I cannot tell you the exact answers. But at least I can show you the right tools that as engineers, we can use. Not only the tools, but the ideas behind them. So just to give you an example, I was going to give a talk. I checked the night before how long it will take me to get to that venue. It's 20 minutes. But it turns out that the next day, when I have to give this talk, now it's more like 30 minutes, 40 minutes. Fortunately, Google Maps can predict how long it will take for me to go there. But the question is, would you be able to predict how long it can take to get anywhere at certain time? One possible answer is using public data sets. We don't have public data sets for every city in the world, but do you recognize this place? New York? Yes. Where did I get this map from? Yeah, no, this is not a map at all. These are just points that I plot in the XY place. And each of these points is one taxi trip that people have taken around New York. So I have 140 million taxi trips per year. New York, the city publishes this. And then I can just take all of these points, put them in a chart, and whoa, suddenly New York shows up. Especially highlighted here is the places where people are taking taxi cabs. In fact, you can see that the north of Manhattan disappears here. This is not because this place doesn't exist, but it's just that taxi cabs are not so interested in going there. So the city had to change the laws to have two types of cabs, and some cabs are not allowed to take the passenger here. They can only go to other places, and that brings fairness. Why is there a light inside Central Park? Yeah, it's where people are taking cabs. So for example, you have these super bright spots also in the airport, because there are streets inside Central Park. Taxis are allowed to, yes. So you have a lot of interesting data here. I just ran this query, and I got the data, and I was able to scan it, and it's super cool. Other things you can do with this public data, this chart is also looking at the time of the day. How many? So you can see how the city breezes and gets more rides at certain hours. And you can see the length of the trips, and you can see how people land on the train central station in the morning and start going everywhere around the city. This chart is really interesting. If you see the blue line, this is how many taxi rides people are taking per hour. Basically, everyone wakes up, starts waking up at 5 AM. The peak, it's like at 7 PM, and there are no rides. There's a huge drop at 5 PM. Anyone knows why? Sheep change, yes. So the problem at 5 PM is that the law says that no taxi driver can drive a taxi for more than 12 hours. This is to protect passengers. This is to protect taxi drivers. And the law doesn't tell them when they have to stop working, but they all decided that the perfect time to do this would be 5 PM. So they all go home at 5 PM. They switch the cabs with another driver, and who loses? Everyone in the city that wants to take a cab at that time. And what we can see in the red line is for higher vehicles, like Uber, where there is no drop at 5 PM. So it's really interesting to take public data and to make really evident problems that your city might be having, like this. This is a real problem. In fact, this is from 2015. Then if you go to 2016, you can see that Uber and other for higher vehicles start matching the demand of real normal taxes. In fact, at 8 AM, when people really need a taxi cab at their home, Uber is winning in 2015. And then in 2017, people are just taking more Uber's than taxes. Still in 2017, the city has a taxi problem at 5 PM, which Uber doesn't have. I'm just trying to show you here the value of having public data and how it can be really useful for anyone that wants to understand how their city or anything else works. On the other hand, there are risks. So when New York published this data set, some people found out that they could match individual rise with famous people just by looking at the license plate and looking for the lines in the data set. Everyone was super scared. And there was a little controversy. And the question here is, how do we find the balance between open data innovation, the value that this data has, and the privacy and the protection of people? I don't know the right answer. Personally, I love open data, and I would love to see regulations on how people use it more than what is being allowed to be published. But at least we should be able to measure and seek out and find out what kind of private data we want to share or not. So on one extreme, you could just log everything out. No more access to data, no more public data. But I think it's way more important to find the balance. And there is a curve. You can get the full value of public data by just sharing everything, or you can protect privacy and just not share anything. Or you can find a meaningful balance by the identification, by aggregation, by creating synthetic data. We should work in this line, and we should find the sweet spot for all of us. So I don't have the answers, but I have some ideas. I have some tools that I can share with you. The projector will stop projecting. But yeah, I happen to work at Google. You might know this company. If you don't know it, you can Google for it. We have to deal with a lot of these problems. We have products with more than a billion users. We have to comply with regulations all over the world. So we need to deal with these kind of questions. And we make the best to externalize our best practices and tools. So is there sensitive data in your data set? Let's say, for example, you are building a 2C.ai robot that has to help people solve their problems. Someone has a conversation like this with your boss. Hi, this is Samantha Robertson. A few days ago, they, my credit card got blocked. My credit card is this number. Please call me back and this phone. This is data. We need to solve her problem. But maybe the rest of our team organizations, the data analysts in our team, don't need to see all of that. So the first step with a message like this is to identify what is TII, the name, a date, a credit card number, a phone number. And then we can decide what do we do with it? How do we store it? How do we share it within our company? How do we share it with regulators? Once we identify the names, we can replace them by tokens. We can make dates less specific. We can say, this was a credit card number, but we don't want to remember it anymore. Or for example, for a phone number, we might want to remember the first three digits to get some cold location, but we can forget about the rest. So we can do partial masking of a phone number, or we could either instead just encrypt it. And if we encrypt it, then we will be able to later get it back by people that have the right key. Or another choice you can make is to do format preserving encryption. In this case, the number is encrypted. We can get back to the original number if we have the right key. But it also looks like a phone number for any purpose, which is also pretty cool. And it would be great to be able to do these operations on the fly. Same thing, you have a huge table with data. It would be great to be able to scan it and to transform it as needed. Other techniques, for example, are bucketing your data. If you have a survey and you have the job title of each person, instead of keeping all of the granular distinctions between a junior engineer and senior engineer, you could replace all of these by engineer and same with an operations manager and make your buckets bigger. Or you could also, for example, do some date mangling. Let's say you have the hospital, and you have a log of what people have done. You might want to randomize the date. But when you randomize the date, it's super useful to also be able to preserve the order. So when a patient, you want to preserve the order when a patient arrived, and when they left, and not go the other way around. And to do all of these operations, at Google Cloud, we made this API public, which you can use to just make your data to first scan your data, to transform it, and to measure how much risk of the anonymization you have in your table. So let me show you a little bit about it. First, as I was telling you, it's really important to do a classification of sensitive data. Let's look into your tables. Let's look into your files. Let's look into your pictures sometimes. If there is any PII there, then you can decide to transform this data in many different ways. And you could also go and analyze and measure there re-adaptification risk. So every time we, at first, we have data with certain risk, we can apply certain transformations, and then we can measure again to see how good our transformation works. And then we can decide in an analytical way where to stop the identifying. There's a lot of different sensitive data. For example, it's really easy to say that personally, PII is sensitive, of course. But also my financial data, also my health data, you can find data not only in tables. They can be living files, in e-documents, in images, and this information can come from many places, from your users, from your employees, if you are HR, data that you are sharing with your partner. And all of these needs to be accounted for. You see any problem here? Yeah, so this is an example of just having sensitive data, a credit card number that someone can share with you in a picture. I'm going to skip the video now, but this is a whole demo of how the API is also able to identify PII in pictures and redact it and change it for just mask it, if that's what you want to do. Yes, that's why we use a fake number here. But it's a number that is an actual credit card. If I try to run this demo, it passes the validation. If I try to run this demo with a totally fake credit card, it will not work. If it passes the credit card validation, the API will redact it, if I ask it to redact it. So yes, as I was telling you, the API is able to mask, hash, tokenize, do format-reserve encryption. And you can use it with calls like this. If you want to make it part of your process, this would be a call where we are asking for rows to be de-identified. In this case, with any crypto key that we might decide. And in this case, we are asking it for the result of the encryption to use the numeric alphabet for the field employee ID. So if I was going to create reversible tokens for the employee ID, it would take my real numbers and transform it in numbers that still look at the previous one. But now my data is encrypted. Same as I was showing you before, we can do the same with tables and do things either in superstructure data and also in unstructured data to be commented we have stored here. The API is able to transform to identify many, many, many types of PII from many countries. But you can also define your own just by defining, for example, a regular expression. Anything that looks like this is PII. But not only that, there's also this field that you can say if something looks like this or it's possible PII or it's very possible. And for that, you have the ability to see what's the context. For example, if you have this pattern and it's close to something that looks like medical or something like that, then you increase the probability of it being PII and you report it as such. And it also has the ability, you have the ability to create dictionaries with millions of entries. This is really important, for example, in chemical industries where they have their own words and it's millions of entries. And the API is capable of also handling this. Just to give you a short live example, let me show you the two one demo where I can say things like, hello. I'm Felipe Hoffa, and I live in San Francisco. My office is in 345 Spear Street. What can we get in real time? Oh, did I kill it with this? Yeah, that's weird. I have to check the demo. So yes, you can see it in real time how it transformed data from it removes my name, it removes my location, et cetera, et cetera. Also, it would work in other languages. Hola, soy Felipe Hoffa y vivo en San Francisco. It somehow thinks that, hola, it's a personal name, but still is able to detect that things are. That's why when it reports PII, it also tells us if it's possible or very possible. Nothing is 100% certain, but it tries its best. So this is the first step. We identified PII, we redacted it, but now we need to decide if we can publish the data set or not. There's still risk here, so we need to understand three different concepts here. When are my identifiers, my ID, my name, et cetera, but there are quasi identifiers. For example, my age, or the city I live, or the zip code where I live. And there's also sensitive data, data that I might not want you to know about me. And once I'm in a public data set, maybe you can identify me just if you know my age and my zip code. And it all depends on how much do you know about me, how much do you know where I live. There are some measures that people use to determine if something is publishable or not. The four measures we're going to touch today here are K anonymity, L diversity, K map, and delta presence. Anyone is familiar with them? For the first one, yes. K anonymity is the most popular one. So for example, let's say we have this survey that my company's running, and we're going to report to my manager how much we like it. Each of his employees, how much we love our manager, and we are not going to tell him any names. We're just going to tell him the age and the zip code of each employee. And here you can see that there are three people living in this zip code that are 42-year-olds, and they gave them my manager different ratings. And we cannot tell who gave him a five and who were the two that gave him an eight. Their privacy is protected. But in this table we have, for example, two people that are 27-year-olds, but they both live in different places. So my manager would be able to say, oh, this is the 27-year-old that lacks me more than the other one. So if we want to share this kind of data, we need to look at what do we do with it here. This is what we call K anonymity, is how many different roles we have with different identifiers. So the guys with 42 have a K anonymity of three, and the other determines that this table has a K anonymity of one. There's three people that I can totally identify by this date. And one solution for this, to deal with K anonymity, one would be to delete the three people that are totally identifiable, but then we would be removing data, or we could bucketize. Instead of reporting the exact age, we could give a range, people between 25 and 29, people between 14 and 44, and instead of reporting the full zip code, we can report only the first two digits. And then instead of having, then people would not be identifiable, you would be able to draw a curve like this that reflects the decision of how much do we bucketize. How big do we make our buckets so we can still retain certain information. We could have a huge bucket that encompasses all of the zip codes and all of the agents, and then everyone is super protected, but the information is less useful. Yeah, that's a very good comment. Instead of doing age, we should use the birth year instead. Yeah, I think I know what you're saying, but let me continue because we are going to see some other metrics. Now, usually the K-MAP is one of the most, K-anonymity is one of the most used metrics for anonymity, a lot of the health and science industry, as for at least K-anonymity of five, and they stopped it there, and everyone is happy, but there are more risks than what we can see here. Is the zip code an age enough to re-identify someone? It depends a lot of what table or what data we are talking about. For example, let's say this is a different survey, and here we have people by age, and we have their zip code, we have their product satisfaction, and basically we know that they live, let's say in the town they live, and their age. Can we identify which person in that town replied to the survey? Normally, the answer is no. If I tell you that someone in Bantam, that is 20 years old, gave me a score of five, now we are not talking about employees anymore, now we are talking about just this is a survey where we ask people on the street. So the question here is, it depends. It depends a lot on what's the zip code, what do we know about the zip code? For K-MAP, at least, the population of its zip code matters. So in Manhattan, in this zip code you have 50,000 people, we cannot identify if someone was replying to this or not. But there's these other zip code, 8535, it has only 20 people, and then the sensitivity depends on the content, on the context where people are living. Probably in this zip code, with 20 people, there's only one 42-year-old, so you can go back and identify him. And that information is not on the table anymore. You need to bring external data sets to be able to measure privacy here. And again, you can measure it. Let me give you an example with Mexican birth. Mexico publishes all the babies that are born every year, I took all the tables between 2018 and 2013, that's 12 million babies, and they publish a lot of data for them. Where they were born, what was the gender of the baby, the weight, the height, some health measures like ADGAR, and they also published some statistics for the mother, what was her age, her education, her civil status, where was she born, her job, her health. So we have a lot, lot, lot of data here, and we could do some measures. Anyone here knows BigQuery? Does I want to show you BigQuery on action? Yes, I'm glad you know BigQuery. BigQuery is a cloud database, data warehouse, where it's fast, it's simple, works with SQL, it can analyze as much data as you have. It's always on, it works with your favorite tools, and you can share data with it. So I took all of these Mexican tables, I published them in BigQuery, and I decided that you could look at my table. So if you want to run queries over them, you can do so right now. And everyone has a free terabyte every month to run queries. Oh, let me run some queries. I have my table here that why not let's try to run some queries live. I have Mexico birthed from 2008 to 2013. You can see this table has seven gigabytes of data, 12 million rows, 12 million babies. And we can run some queries over it. It has a lot of columns, a lot of data. And for example, if I wanted to analyze, I have my query here, I have a query ready here, where, for example, in this query, I just get in for each state where the mother was born, how many babies we have. So in Mexico City, we have 1.2 million babies. For the smallest state, we have only 3,000 babies. So our statistics, the privacy here will depend a lot if we are in a big state or a small state. And we can compute interesting stats like, for example, let's get the average weight and the average height of it, baby, to see where we get the pattest and the tallest babies. And you will be able to look at this. Also, very, very important when you're working with public data sets or any data set, take a look if your data is clean or not. In this case, for example, it's really useful to remove all of the babies that are smaller than 15 centimeters and longer than 60. And babies that weigh less than a kilo or babies that weigh more than 6 kilos. You can find a lot of them with probably wrong data. It's just there, so you need to deal with it. Just to show you a table of this, this is the distribution I found for babies. You can see that some places have the tallest and also heaviest babies. And these are pretty small babies, et cetera, et cetera. This is interesting. Again, when I show you the data, I want to prove that it's valuable to publish this. We can go and find interesting things. We can see that for the same height, these babies are super thin and these babies are super fat. Or, well, I should not fat change the babies, but they are heavier, they're healthier than the others. And we have 15 minutes left. Some data I have here, for example, for each baby, is where they were born. I could be talking to someone, and they could tell me, hey, turns out I was born in this region. And you can probably tell their gender. And maybe you can ask them a question. So what did your mom do? What's your mom's occupation? And what's your mom married? What can we do with that data? Let's take a look at the, again, to the table. Let me run this query. With this query, I'm taking where the mother was born, where the baby was born, the marital status of the mother, the education also of the mother, and the gender of the baby, because the mother is a female, probably. And you can see that, yeah, in the big state, we have 150,000 babies that were born for a mother that was not married, domestic partnership, secondary education complete, and these babies were male. But on the other side of this query, we have a lot of unique people. So for example, in Yucatan, there was only one baby born to a separated mother with no education, and that baby was female. So here you can identify the baby. The key anonymity of this tail rule is one. The question is, if you are the state, if you are publishing this data, what do you do about it? Do we remove this data? Do we bucketize it? I don't know the right answer. I love having the data public, but we can tell that there is a re-identification risk, and I could find the exact time where you were born, and I could find a lot of unbearable just knowing these four things. So these are my results. A good thing about, so you could rank words like this to measure key anonymity and to measure different privacy risks. But with the API, I could also just ask the API to go check my table. Like this is the name of my table. This is the configuration for to measure key anonymity. It can go and find and report me back the results. Like, oh, I found this many groups. For example, I found 189 groups that have only five people. There's a 598 group with only one unique people, and then I need to choose, well, do I report all of them, or maybe I should not report about them. But then if I remove them, if I delete them, I'm removing pretty important data for people that live in less populated places. And just removing these places from the map affect also public policy and how much care we take about them. 10 minutes. So let's say we delete everyone that has a key anonymity less than five. For example, we have five birds from post-graduate divorces in Coahuila, the Zaragoza, between these years. Is it OK to publish this data set if we are sure that the key anonymity is five? Some would say yes. That's an industry standard. But is it OK? I don't know how I would tell if someone is the same mother, but you're right. I have not thought about the problem. But yes, that's an interesting one. So the problem that I have here is that if I go to my table and I go find these five women with these characteristics that had a baby between these years, turns out the five of them have hepatitis B. So if you are one of these five women, I know something sensitive about you. And key anonymity did not protect them. How do I measure this? The title says it. The measure of this is L diversity. I need to take a look for my sensitive data, what are my sensitive columns, and ask to also have some diversity in this column. So that means I would need to make my group larger, even not depending on key anonymity. Key anonymity, we were talking about KMAP. I have my, wait a minute. Yes, I saw KMAP previously, so I wanted to keep this one. And the last thing I wanted to talk about, just very briefly, delta presence. And this is a different measure that takes a look that the problem of just having people present in our table. So let's say we are reporting how many people in certain place have a certain characteristic. And we say seven people in this state have hepatitis B, to say something. And if then it happens that only seven people between these ages live in that place, then you're revealing that all of them have this characteristic. So that's something else that you also could measure, that you need to have an external data set with that information. We have some very few minutes left, so let me go fast. So with DLP, again, you can measure all of these. And I just want to show you from an API point of view, how do we measure these four things. So for example, for K anonymity, I need to know what my quasi-adds are, what these are, location, age, et cetera, birth year instead of age. And then I also have the ability to group things in case people are duplicated. But basically, to measure K anonymity, I just need to say, tell the API what are my quasi-adds. To measure L diversity, I need to tell the API what are my quasi-adds, and which ones are my sensitive attributes. Where do I want to have diversity? And then with KMAP, yes, I have my quasi-adds. But I should, to measure KMAP, I need external databases of how people are distributed. For regions, we have all of this data preloaded. You just can tell, this is the column that talks about region, and the DLP will bring the demographic data. Or I could bring my own auxiliary tables if the information is not there. And same with delta presence. Basically, we have the same variable. Five minutes. There's a lot that we have published about this. Something interesting about the taxi data set is that as people started complaining, the city started publishing less data. So for example, at first we had the exact location, and we had a hash of the license plate of each car. On the second iteration, they deleted the hash of the driver. So now we cannot follow individual taxes and what are they doing during the day. And that affects a lot our ability to find out what would be a good policy to deal with the shift change. Because we cannot look at individual behavior anymore. And now they don't publish latitude and longitude anymore. Now they just publish the zone where you took a cap. So I cannot draw maps like this anymore. We just know where people are, general area of where people are taking it. So yes, there is pressure to de-anonymize this data set. But they start losing their money. So for anonymity solutions, you can provide a shorter location, you can remove the driver ID, you can remove infrequent areas. So for example, what if we start removing all of these people taking caps in very non-populated places? That's a good idea because their privacy is at risk here. But then we would be getting less information about them and we wouldn't consider the people for public policies were for people that really need transportation because they live in remote area. And something that also the city can do is share the detailed data just with certain partners and not with the general population. We have to do these things like this. For example, when we publish our own data set, we have the Google Analytics sample data set that we have to de-identify. My teammate says on privacy, he publishes a lot about privacy and a lot of these ideas. He's one of the engineers that has worked on this API. So I would recommend you to follow him. He has his Twitter account, his blog. A lot of these things, he goes in real depth about it. I've learned a lot from this post. And here what I'm trying to try with this picture is that in the name of privacy, we could ban public data. Also in the name of security, we could ban matches. But there's a lot of value in having them, even though they are risky and many kids are burned every year. So it's all about finding the balance. And even if you are not dealing with public data, you still share data. You share data with your employees. So not everyone in your company should look at the full messages. You share data with their parties, partners, and you have all of these tools. If you're doing machine learning, that's a huge problem too. How do you feed into your model? We recently released this tool to get differential privacy when you are planning TensorFlow. It's really cool. I didn't talk about differential privacy at all today here. But if you look at the full pipeline of how would you treat a public data set or internal data set before running machine learning, first, of course, you want to understand your data. So you would need to scan for PII. Redact, remove the data that you don't, the PII that you don't need, or tokenize it if it would be important, but not at the granular level. And yes, then, you can go and use tools like TensorFlow Privacy to remove risk. But if you only use TensorFlow Privacy, your models could be leaking PII. And also, if your models are touching PII, even if you use differential privacy, you may have to deal with regulatory requirements that you still want to go through the first three steps. I talked about some tools today. I talked about BigQuery, Data Prep I Skipped, DLP is our API, and I skipped Stackdriver that would be our tool to just log everything and be able to audit what happened. How are we doing on time? We have like one minute left. Let me tell people that this, they can find me on Reddit. They can find me on Stack Overflow. They can find me on Twitter. And I love, love feedback. So if you like this talk, please leave me some feedback. And if you didn't like it, please tell me why you can go to this URL on top. Thank you very much. Thank you very much.