 My name is Christina Lu. And I'm here to talk about engineering privacy and user identity protection from the get-go. Who am I? I am a senior security engineer at Cisco Meraki. I am also a certified information privacy technologist. I am at Clipvilo on Twitter and I also have a website. So what will this talk cover? First, we're going to talk about why it's important that we protect user identity and personal identifiable information. And we'll talk about what is personal identifiable information and data. Then we're going to get into the dangers of re-identification. And then finally, we're going to end with some practical takeaways that people can go and implement. Now to demonstrate the power of personal identifiable information, we're going to play an imagination game. And this one is going to be for all the burrito lovers out there. I'm from San Francisco, so it's burritos there. Now, I'm here to talk about burrito match. This app is the hottest thing in burrito recommendation engines. With a few things about yourself, it will recommend you your perfect burrito. So you have to enter in your dietary restrictions. So are you a pescatarian? Are you a vegan vegetarian? Are you an omnivore? Do you have any other dietary restrictions? So are you gluten-free? Do you have allergies to like things like corn, legumes? Do you need your food to be halal, kosher? Do you for some reason hate avocados? So burrito match will take all of that information, run it through their algorithm, and find you your perfect burrito. But not only is it your perfect burrito, it is the perfect burrito that's closest to you right now. Because time is of the essence when one is hungry and angry. And so this app is so freaking good that you use it like every day for six months because ain't nobody got time to cook. But what if this app was not forthcoming about its data sharing policies? What if the information that you like every burrito with extra cheese and extra sour cream and a medello, which exceed the doctor recommended weekly servings, get sold up to health insurance providers and your health insurance premiums go up? Or even worse, what if the information gets sold up to organizations that do surveillance tracking? And now we can do religious surveillance because of the location data and the halal or kosher filters. Suddenly this app goes from whimsical and fun to dangerous and disturbing. Thank goodness this app is completely imaginary and only exists for our game, but there were and are apps that are personal data nightmares. Does anybody remember the iPhone 4? Remember how the light would used to only turn on when we took flash photography? Which was something that nobody ever wanted? But what we actually wanted was the light to be able to stay on and a steady beam to be used as a flashlight. And because of this user demand, there suddenly became a proliferation of these third-party flashlight apps on the app store. And I'm going to talk about one in particular and that is the flashlight app built by iHandy. Now, an analysis was conducted by App4D, which is a mobile security software company. And they determined that the flashlight app built by iHandy had access to users' location data, could read your calendar, could use your camera and had access to the unique identifier of your device itself. And with this, it also had the potential ability to pass that information on to advertising networks, all without user consent. And users care about what happens to their data. In a 2022 Consumer Privacy Survey done by Cisco, they surveyed 2,600 adults across 12 countries. And 76% of those people said that they would not buy from a company who they do not trust with their data. And not only is it a trust issue, they view it as a user respect issue because 81% of those people also said that how a company handles their data is indicative of how a company views and respects its customers. So whatever code you write, it impacts people. And whether that's a burrito app, a chat app or deployment coordination software, you want your impact to be positive and you wanna be building better products. And you don't want unintended consequences hiding in your code or in your architecture because when privacy and security are mishandled, the consequences can affect people in very real ways. Here is a chart from 2017 from Experian that shows the value of people's data on the dark web. So social security numbers, worth about a buck, surprisingly, but the passport information and passports are about a thousand to $2,000. So with this, it should be kind of obvious why privacy is important, but like, what is it? We usually hear about privacy in terms of buzzwords and things like that. And usually in the terms of the millions of dollars lost due to data breaches and things. But at its core, privacy is an individual's rights to maintain control over their personal information. And this control can be achieved through policies such as legal policies and corporate policy, but also technical engineering controls. Hand in hand with privacy comes security. And with security, we get even more buzzwords and rants. And usually we hear about it in the terms of like, oh, that's top 10, fishing, threat actors, the most popular being hackers, hackers, and more hackers. These hackers here ended up being the heroes of our story. For those that haven't seen this movie, it's called Hackers. And it's part of your homework assignment to watch it because every pop culture reference is security is from this movie. So Angelina Jolie aside, security at its core is these systems and the controls built to protect information. And that information is things like proprietary code, credit card information, and yes, personally identifiable information, also known as PII. Now, security can help achieve privacy, but it alone is not enough to protect privacy and PII. We talk about PII, it's lumped into two buckets, sensitive and nonsensitive. And a special note here, what types of information that gets counted as sensitive versus nonsensitive can vary between industries and laws. So be very careful when you're doing your classifications. So sensitive PII, as defined by the Department of Homeland Security is, data that have lost compromise or disclose without authorization could result in substantial harm, embarrassment, inconvenience, or unfairness to an individual. The TLDR is if you have information that can quickly and accurately identify an individual, that is sensitive PII. So some examples here is social security numbers. That's an important number, we need it for housing, we need it for loans, we need it for employment, driver's license information. Generally we don't change driver's license numbers unless we're moving and then biometrics because you can't change these. So the other one, the nonsensitive PII is information that when by itself cannot be used to quickly and accurately identify an individual. However, you have to be careful here because if you have different bits and different types of nonsensitive PII, and if you can use that to then quickly and accurately identify an individual, then it becomes sensitive PII. So information like this is very commonly collected for things like marketing, customer service, and other research. So even though if just by itself it's not considered sensitive, care is still needed to ensure that this data is protected from unauthorized access, use, destruction, all that good stuff. And to protect this data, we can, to be able to protect and use the data collected from this, we can use a concept called de-identification. So de-identification is the tools and the techniques that organizations use to minimize the risk of using, storing, and publishing data containing PII. So here are some common de-identification methods. Special note here, the names of these might change depending on what industry you're in or what laws you have to deal with, but the idea is very similar. We're gonna first start with redaction. It's the removal of data. I like to think of this in terms of like military documentation where if you have a letter that says top secret and you have half the letter cut out of it so that you can't read it anymore. So same idea, the removal of data. So types of data you may want to remove from your data set is things like name and social security number. Another one is masking, also known as pseudonymization, which is the idea of obscuring your PII. So if you have a data set with social security numbers, instead of keeping them in plain text, can you still get the job done by replacing it with all stars? Or can you run fields through functions that turn them into strings of random numbers and characters? Another one is generalization, the idea of grouping your PII together. So let's take for example, if you have a data set with people's ages, do you still need the actual age of the person to find your answer or can you say that this person is over 18 under 18? Or are they over 65 under 65? Another one is obfuscation, adding noise to your data. Basically, we go back to the example of the data set with ages. So instead of having the exact age, can we just round them up or down to the nearest decade and still get the job done? Now obfuscation can be an aggressive form of de-identification and can potentially make your data harder to use, but there are good use cases for this. So if you have data with incredibly sensitive data sets with incredibly sensitive data, like healthcare data, this is probably a good way to go. So data handling and disclosure is even more important now than before because as of March 2nd, this year, the company BetterHelp was fined by the FTC. The FTC fined them $7.8 million and they charged them with the sharing of consumers' health data, including sensitive information about mental health challenges for advertising on platforms like Facebook without user consent. And this oopsies is real bad because consumers have the expectation that their health data won't be shared or sold without their consent. And protecting PII is important because, well, we're not anonymous anymore on the internet. Unlike what this New Yorker cartoon states, which is on the internet, nobody knows you're a dog. However, we can be potentially re-identified from using multiple de-identified data sets. So let's take a look at how we can re-identify dog friend here by using something like movie data. So let's say we have a data set from a movie streaming service where you can rate the movies as well. I'm gonna call this PubFlix. Now we also have access to another data set where there's movie ratings and rankings. I'm gonna call this data from Squishy Tomatoes. So in both the data sets, you can see that our dog friend liked movies like Lassie, Dog and the Wes Anderson heartwarming animated movie, I Love Dog. But they didn't like movies with cats, so cats, the musical, Garfield and the Tiger King. Dog friend was also not a fan of Carol Baskin. So we're then able to match the PubFlix and the Squishy Tomatoes data set and re-identify our user because the Squishy Tomatoes data set has their name and profile picture even if the PubFlix data set was de-identified. So this example, while it sounds kind of nuts, is actually not hypothetical. So two researchers did this at the University of Texas. Their names are Arwen Narayanan and Vitaly Shmetikov. In 2006, Netflix had a contest where you could win a million dollars if you helped them write a better movie recommendation engine. They released a de-identified data set that had information of over 100 million movies, almost information about almost half a million of their subscribers and also six years worth of that data. So what our researchers did was they took this data and they matched it to the public records from IMDB and from that, they could re-identify the user in the Netflix data set by matching the movie ratings and the posting date from IMDB. They only needed eight movies. Two of them could be wrong and the posting date can differ by 14 days. And with just that tiny bit of information, our researchers were 99% confident that that user could be re-identified. And the researchers also published that other traits like sexual preference and political party could be inferred based on how people ranked these movies because what movies we like are really based on our own personal interests. So another example of re-identification happening from unlikely data sets is from an experiment done by Dr. Latanya Sweeney. Who is the founder and director of the Data Privacy Lab at Harvard. Her experiment showed that you can match hospital records to newspaper articles. So she paid the state of Washington $50 whole and got a de-identified data set of patient records. This information contained things like patient demographics, clinical diagnosis procedures, all that stuff. Again, de-identified so the names and the addresses were removed but some of the records did have the zip code still. So she then went to a newspaper database called LexusNexus and she searched for the term hospitalization from stories printed in 2011 because this was done and in the state of Washington. She got 66 articles that matched. Then and also newspapers are in the business of informing the public of current events. So they do publish specifics like name, age, treatment and other information. So she was able to take these two, match them together and see if she could re-identify somebody and she did. So the box on the left side is the article and then the box on the right side is the patient record. So if you look on the yellow, 60 year old we got an age which does match back to the patient data set. In the teal soap lake man which gives us a location because it matched to the zip code in the patient data set. The time Saturday afternoon in blue, then the reason why he's in the hospital, a motorcycle accident which matched back to the data set and also the treatment hospital in Orange Sacred Heart Hospital. And by matching that we can see that in the newspaper article the person's name is Ronald Jameson. Now for the purposes of this, they changed the name of the poor person. And you can find other information now about Ronald Jameson because of the data set. In the patient data set you can see that this person also has Medicare. They have other heart problems that they're dealing with and they're white, non-Hispanic. So due to the work of this study, the state of Washington did make changes to increase the anonymization protocols of their public health records. So a feel good story here. So in addition to the human consequences of mishandling PII, there are legal challenges and consequences. So GDPR, we've heard this acronym a whole bunch. So GDPR is really expensive. They have big fines. If you have a less severe fine, it is 10 million euro or 2% of last year's revenue, whichever is higher. And if you screw up real bad, it is 20 million euro or 4% of last year's revenue. So here's a wall of text. This is, what's that word? It is basically just a bunch of different other privacy laws that we have. I'm not a lawyer, not gonna go through all these. It's just important to know that there are different laws that companies can be sued for or fined for for data breaches and mishandling data. There's at least 40 other sector and industry specific privacy laws, so be careful. So also in addition to having like a bajillion different laws, there are different thresholds for what you need to do if you have a data breach. So for some laws, if you have a data breach affecting like 100 million customers or 100,000 customers, then you have to do a public notification. But with HIPAA, and HIPAA in some states specifically, the data breach can be as small as 250 people affected and you would have to do a public disclosure that their data was exposed. So it's a lot to remember, but what can we do? Well, oh, never mind. So here, if you thought all the laws were confusing and there's a bajillion acronyms, you're not wrong. Privacy law is a fast moving and quickly changing area of tech. Is there were changes made as much, excuse me, as recently as this year? And at the time of this talk, in the United States, there is no comprehensive federal law that standardizes how PII should be handled. And all the existing laws at the moment are patchwork and completely reliant on the individual states. But yeah, what can we do? Well, here's nine things that we can do. And the first one is, the first rule of PII club is don't collect or store unnecessary data. And the second rule of PII club is, don't collect or store unnecessary data. If you remember nothing else from this talk, just remember, don't collect or store unnecessary data. And I'm done, just kidding. So the second thing you can do is automatically delete that old data, create a schedule for when that data is going away. It's called a data retention policy. So modern cloud storage systems like AWS have configurations to make this stuff scheduled, so set it and forget it. The third thing you can do is use only the data needed to get the job done. Be incredibly selective of the type of data that you'll be processing and storing. And don't be afraid to ask if they can get the job done with a smaller, more limited data sets. We want to make it harder for re-identification attacks to succeed. Another way, number four, is store your data in a non-identifiable way if you must store it. So can you break up this data set and store it in multiple systems? Or can you restrict access to this data set so that only the people that need the access to get their job done have access to the data? It's called the principle of least privilege for those that don't know. Also, encrypt all of that data at rest and transit. Number five, you want to build for privacy and security in the beginning. Because well, it's never cheaper or faster or less effort to bolt it on later. If you bolt it on later, you may have to materially change what you built or retire it due to privacy law violations. You want to build to the strictest standard for most of us that's going to be GDPR. So use that as your guide. And six, do not test with production data. This is a violation of GDPR and other laws. Also, hackers will very commonly target your development environments because, let's be real here, our dev environments are never as hardened as our prod environments. So also, dev is inherently kind of an unstable environment where you go there to make changes to features and potentially security configurations if someone makes a mistake. And if you do need data sets to use to do testing with, you can check out Kaggle, K-A-G-G-E-L. My favorite data set at the moment is the Thailand tourism data. And also, Makaroo, if you just need to generate test data for things like CSVs, JSONs, things like that. Now, seven, implement good RBAC for those that don't know what RBAC is. It is role-based access control. There should be clear permissions between what your admin roles can do and what your user roles can do. Do this internally in your company, but also build these features into your product then you can charge more money. This can help prevent people from stumbling across PII either through accidental exposure or something more nefarious, like insider threat. Also, if you don't have this, have good RBAC, what's going to happen is that your summer intern is going to get access to admin credits, make a bunch of configuration changes, leave, and then you have to fix it. So have good RBAC. Eight, let your users opt out of third-party sharing. When they do this, don't penalize them. Their services should not be affected if they choose to do so. Also, coming, I believe, in June or July in Colorado, you have to let your users opt out. It's actually, you have to let them, you have to default up them out of data sharing. So be careful. And in California, conversely, the laws for that is that you just have to let people know that you're going to be doing data sharing. So, and even just from this example alone, the privacy laws are changing fast. So work with a privacy lawyer. Also hire or work with privacy engineers, because those are the folks that can help you build the technical implementations to help you automate some of these privacy controls. So, yeah. So the code that you write has a human impact. Even if at the surface level, it doesn't seem that way. So we as software engineers are the stewards of our users' data. So it's important to know how our users are expecting us to protect their identity. Well, because it's the right thing to do, even if it takes a little bit more time or effort to build. Because at the end of the day, and after all, I know that you would want the company responsible for your PII to be also taking the utmost care and consideration and do the right thing too. So, again, I am Christina Lu. Thank you so much to Okta and folks watching in the camera. Some sources.