 Hey all, my name is Pushfak Pujari and I'm here to give a talk about how to use machine learning to protect customer privacy as a product manager. Now before we get started, I wanted to give a quick shout out to the awesome team at Product School. Thank you for inviting me to give this talk and for working with me patiently throughout this journey. You guys are awesome. A bit about me. I currently lead the product team at work order on the security cameras product line and before that I spent a couple of years at Amazon where I spent one and a half years at Amazon's Alexa AI Group working on privacy-related products and services. And before that for two and a half years and the Amazon Web Services IoT Group. I hold an MBA from Wharton and I hold an undergraduate degree in electrical engineering from IoT Deli. In my spare time, I love playing tennis, going hiking, beer brewing and talking about all things technology. At the end of this webinar, I want you to go away with key four takeaways. I want to start by explaining how privacy is important. Why has it become so important in today's world? For the part two of the presentation, I will talk about some of the privacy preservation techniques that I use today in industry and academia. For the third part, I will do an end-to-end walkthrough of an example of how you can use machine learning to protect privacy. And I would leave you guys with a few tips and strategies on how to be an effective product manager in privacy. So let's get started. First and foremost, why does privacy matter? Well, it matters because we live in a data-first economy. Data is the new oil and which means that companies collect and retain tons of customer data. Anytime you're ordering something online or you're ordering groceries or you're requesting a loan provider to approve that loan for the new car that you want to buy, your data is being collected and processed by the companies. Sometimes they also need to retain the data for legal or regulatory requirements. And many times they want to use that data to ultimately give you the end customer a better customer experience through personalized recommendations or marketing. And in some cases, the customer, the company might just be repackaging your data and selling off to third-party data aggregators. You don't know what's going on till you read the fine print. And the data that these companies are collecting today can contain sensitive information. And if this data is in the hands of the adversarial or attackers, then it can have devastating consequences, not only for the individual who will have the identity theft or identity stolen and the financial records erased or misused, but can also be devastating for the company where the data breach happened. And we don't talk about it a lot or maybe we don't hear about it, but data breaches happen way more frequently than you think. On the right side is a graph from a statisticer that tracks the number of data breaches that were reported in the years. And if you see the data here in 2017, there were 1,632 data breaches that were reported and in 2018, 1244, it's almost like four breaches per day, but we don't hear about these breaches because they almost always don't make it to the common press. And the problem is also aggregated by the fact that the data that the companies are collecting are all spread out across organizations and sub-organizations across different storage media. And once the data leaves the walled garden, it's almost impossible to reclaim control over that data, which is why privacy or legal authorities across the world have started thinking about the end customers privacy and what does that take here? And have instituted a bunch of different laws from GDPR by the European Union to HIPAA, CCPA, COPPA in the USA and many, many more that are getting in the works of getting deployed or activated in the next couple of months or years. Additionally, there's this growing distrust with social media providers, especially after the Cambridge Analytica scandal on how different companies are using or misusing their data. And ultimately at the end of the day, a customer wants transparency on how the data is being collected and used. And that's not an unreasonable ask if you're a customer. So we have been talking about private data. What exactly does private or personal data mean? Well, in a high level, it could mean two things, either anything that can be used to directly identify a unique individual. Typical examples of this include full names, addresses, social security numbers, phone numbers, credit card numbers, et cetera. Then there's this other class of indirect identifiers or quasi identifiers that by themselves can't be used to identify a unique individual. But when combined with other data out there publicly or otherwise could be used to identify individual uniquely. Examples include gender, demographic information, salary, and location history. Within an organization, data can be classified in multiple buckets. The most typical classification use for security is shared below. Identified data can contain direct or indirect identifiers and has no processing done on it. So anonymous data has known direct identifiers, transformed or eliminated. So it gives certain amount of privacy beyond identified data. The identified data takes it up notch above where direct and known indirect identifiers have been removed. And the anonymous data is at the not star where the data has mathematically guarantees that the data cannot be used to explicitly identify an end customer. And this is represented from the graph on the right where personal data, say John Doe, can be transformed into a gibberish of pseudonymized text using some kind of a tokenization function or a key function. This can be one way or two way. We'll talk about synonymization later on in the presentation. And at the bottom is anonymization where you can add a random amount of noise and kind of create a string or substring that has no way of being linked to the original text. If there's one thing that I want to leave you with this presentation, that is this. As product managers, you will be grappling with this privacy versus utility tradeoff, which means that you can't eat the cake and have it too. With, if you wanted to have more privacy, then the data will have to reduce its utility. And at the extreme end, from this, as we can see from this picture in front of us at the extreme end, there is 100% utility, but no privacy. We can all tell that this is President Barack Obama's photo. On the extreme right side, you can see that we have added a lot of noise. So there is 100% privacy is anonymous. This can be anybody in the picture here, but again, this leaves no utility in the data and it's pretty much moot to use this data for any operation, machine learning or not. And so throughout your careers and throughout your products, you'll be always grappling with where is the right optimal point or inflection point and that gives you the most privacy but still retains enough utility in the data. Now privacy enactment is a truly cross-disciplinary effort. If there are multiple shareholders or stakeholders that need to work together to ensure that privacy implementation happens consistently across an organization. There's a compliance team that ensures that the organization is compliant with the top security regime out there or compliance regime. Then there's a information security group or InfoSec that ensures that the data has been secured with the utmost integrity and with the tightest controls. Legal team that gives you guidance on how to deal with the privacy laws. And then of course there's a privacy engineering and the product team that is responsible for building out the tools and these are experiences that allow the privacy preserving computation to happen while delivering an amazing customer experience. Well, why do we need to think about privacy? Well, there are obvious downsides of not thinking about privacy. First is if the company has data breaches then it could be subject to huge fines by the Federal Trade Commission or FTC. And even GDPR stipulates that a big breach can lead up to 4% of annual revenue being fined as the penalty. So the penalty is huge. At the same time, if this happens consistently the company might actually use its business license as well. And not to mention the fact that data breach definitely erodes customer trust and loyalty. So there might be far reaching consequences of not taking privacy seriously. On the other side, taking a privacy first approach gives companies increased loyalty and retention can help with customer lifetime value and higher conversion rates because your customers will trust the platform better. And for some companies like Amazon they have also made it a very strong comparative mode where the iPhone has been dooted as the privacy first phone which is something that you see in the ad campaigns across the world. And so from a business standpoint privacy first positioning is a stable stakes. There's no way around it. I need you to start thinking about this if you already haven't done so. So what are the different sources of risk of privacy risk within an organization? Well, there are three big sources that exist. One is of course the raw customer data and any of its derivatives that are being stored by the company. Second are the metadata and the logs that could be used to join with some other data sets and reveal or identify the end user. And of course there are machine learning models that are though unintuitive a derivative of the raw customer data. And so might retain some of the characteristics which can be used by clever attackers to reconstruct the actual or original data set. Of these three, the raw customer data is the most important thing that needs to be protected but machine learning models should also not be ignored. Now protecting machine learning models in itself is a full seminar or series of seminars in itself but this is a 30,000 foot view of why privacy even machine learning models present privacy risks. So on a high level, you can think of a model as a derivative of the data that kind of takes in the end data and summarizes it. So that and which is used to do machine learning inference and make predictions. A machine learning model depending on the architecture if falling in the wrong hands can be inspected and the contents examined to try to and an attacker can run a series of attacks on the model starting with model inversion attack where the attacker would try to get back the feature vectors which in turn can be used to run a series of reconstruction attacks on and get back the actual raw data. And on the bottom is membership inference attack which in where an attacker would have a test data set and try to guess if that any of the data within the test data set has been used to train the machine learning model. And if it found so, that is going to preach the privacy of that individual. And the logic behind doing that is that a machine learning model behaves slightly differently when it sees a data point that it has seen before. For example, as part of it's the training data set for example, and how it behaves if it has not seen the data point before. And here's a graphical representation of that if there are non-members or data points that it has not seen it might have a distribution like this. But if it sees something new or my bad if it had already seen something in the previous training data set then the distribution is slightly different and that this delta is enough to tell smart attackers that a particular data point was present in the training data set thereby preaching customer's privacy. But so far I've been talking a lot about why privacy is important and it's very important for us to preserve customer data. My objective here is to not get you alarmed and locking away the customer data in a secure vault and throwing away the keys is not really the answer. What we as the product teams want to do is be able to use the customer data to deliver an amazing customer experience but without sacrificing their privacy. And so for the rest of the presentation I'll be focused on how you can do that using machine learning. For part two I will cover some of the privacy preservation techniques that exist in the academia and industry today. Broadly you can break these techniques down into two buckets. There are data sanitization techniques and there are privacy preservation privacy preserving computation techniques. The former relies on perturbing the data or changing some of the contents of the data so that the data it cannot be used to uniquely identify an individual. On the other side privacy preserving computation relies on not changing the data per se but changing some of the ways in which data is being computed so that you can still get the insight from the data but without sacrificing privacy. Privacy preserving computation itself requires multiple webinars to talk about them. So in this webinar I'll just focus on data sanitization techniques and perhaps come to privacy preserving computation at a later point of time in a different webinar. So we talked about this briefly in the past but here are some of the direct identifiers that HIPAA requirements requires you to address. And as you can see these are all any of these in itself can be used to uniquely tie back to an individual. So our first technique of protecting privacy was to, well, why don't we just, why don't we just detect the different kinds of direct identifiers that exist in the data and remove records in case we find any of them. Now doing this is pretty easy. It's easiest to implement but unfortunately doing this is gives you no measurable privacy guarantees. You are only addressing identifiers that you have access to or that you know that exists. So you're addressing the known knowns. You don't even know what the known unknowns are at that point. This method requires humans in the loop to be able to detect and correct or actually generate the high quality training dataset that is used for machine learning training. And once you have set these, define these identifiers and beta solution extending it as a business close for different locales, different countries. It's pretty hard and takes enormous amount of time. Another technique that gets us to more anonymization is by transforming the data using some kind of a mapping function into unique tokens that by themselves cannot be used to re-identify the original or go back and find out what the original data was. This mapping can be one way or two way. Here we have tried to take a credit card dataset and transform the middle digit so that into some gibberish that by itself cannot be used to recreate the original dataset. This definitely is gives us stronger anonymization and also not that hard to implement and also keeps some of the utility of the data because it allows you to join the dataset. But this is not safe because an attacker with who knows the logic of the anonymization along with the token can potentially extract the original data. And this needs to be consistently enforced across the different end points within an organization that could be challenging. Another way of getting more stronger anonymization guarantees is by using key anonymization where the objective is to make each record indistinguishable from at least K minus one other records. And this is done through smoothing or generalizing the datasets. For example, for zip codes that start with 944 we could just round it off to the first three digits and use that as a training data. Same with age, we could round it or converge it to a central data point for a range of ages. As you can imagine, if K-anonymization gives stronger anonymization guarantees than PI detection and randomization, however, because you are adding no issues to the data and perturbing it, it does impact the utility of the underlying data. And more importantly, choosing the right K values and logic is hard. And the generalization logic itself on what algorithm you use to smooth out the dataset or take an average median, that also it's very custom and requires custom implementation to do, which can be expensive. And lastly, we'll talk about differential privacy, which is the state of the art when it comes to providing anonymization guarantees. This is a technology that gives you measurable, mathematically proven privacy guarantees. The fundamental behind differential privacy is simple. Is this that a single record should not impact the outcome of the query? So basically differential privacy ensures that records are kind of hiding within the entire dataset and no one record stands out. There are a couple of downsides of working with differential privacy. It is pretty hard to choose the right parameters and requires a lot of valid experimentation. It's also not practical yet for a lot of use cases because it takes a lot of time to get right and has to be continuously evolved as the dataset evolves itself. And thirdly, maintaining differential private datasets at scale can be expensive and difficult to do. For part three, I will, let's go through an end to end walkthrough of how we can use machine learning to detect how we can identify some direct identifiers. And from our typical product management standpoint, let's start with the use cases first. So I've made up an example here. Let's assume that your website has a search bar and as we one of this project, you want to scan those search phrases for direct identifiers. And if you find any, you want to delete the record immediately. And as an extension of this use case in the future, we want to be able to do it at end mass so that an employee who's trying to access customer data, say for analytics, can get back a dataset that has no direct identifiers in it. We'll define a few functional requirements, starting with five different identifiers that we want to detect in B1. Full name, addresses, telephone numbers, email IDs and social security numbers. We will focus our attention on the ENUS locale only. And we've arbitrarily set a goal success criteria as precision of 70%, recall of 95%. We'll talk about this in the latest slide. From non-functional requirements, we want this to be done in near real time. So we have taken a generous 250 milliseconds as margin to able to scan one query. And in the future, we also want to do an API for doing it at batch or in parallel. So what are the requirements to create a spicy machine learning model? Well, there are five things that you need. You need the training data, you need to define the success metrics, you need a model architecture, you need to have an infrastructure that allows you to train, test, iterate, and host the model. And then you need a plan for a workflow for continuous improvement. Let's talk about each one in one slide. For training data is the most important part of the machine learning pipeline. Basically, we follow the principle that if you use garbage data to train your model, your model will make garbage predictions. And so the trick here is to use training data that is as close to your runtime data as possible when it comes to the syntax and semantics. Of course, there'll be a bunch, you need to work through human labeling challenges because identifying direct identifiers is a very cognitively challenging task. It's because of context. Let's take an example. Let's say you hear the word Fargo, right? And you don't know what to do with it because is it an address, the city of Fargo, which then ultimately is a direct identifier for an address, or it could be a reference to a movie, it could be somebody's name. Even if you hear the name, for example, Britney Spears, is that somebody's name or are they actually, is that the celebrity that they're talking about? Because with the celebrity, it's not really a person later because their name isn't the public domain. But if it is somebody's name who was named after a famous pop star and you're using the name, well, it could be a breach of the privacy. So it's really ambiguous to determine what really is that identifier and there's a cognitive load and requirement to know the context behind it. Then using the, sending or collecting data that you want to use for labeling might contain examples of such privacy preaching data points, which when exposed to a human labeler, could in itself expose privacy risks. So there's an inherent risk of getting data labeled as well, which is something that you need to think of as you go through the machine learning pipeline. Because the labeling itself is not intuitive here, we want to start tracking labeling metrics or what is the quality of which labeling is going on, which in itself would be challenging. And as a product manager, we need to ensure that we are using the right size and diversity in the data so that we minimize the overfit and underfit effects. As a product manager, this is the most important part of the workflow that you should be worried about. What are the metrics of success and how do I evaluate success? Typically, the most commonly used metrics in machine learning are precision and recall, but which of the two is more important for this use case. Any guesses? Well, for us, it's going to be recall. It's because the cost of false negatives is more than the cost of false positives. In other words, the cost of misclassifying private data as not private is more than the cost of classifying something that was not private to be private. So in the latter, you would lose a few data points, which is fine, but in the former, that is actually misclassifying private data as not private data, you will be using that across your ML workflow and sharing it with other people that is going to be a massive breach of privacy from your customers, from your company side. So as a product manager, and for privacy-related use cases, you should emphasize or focus on reducing or improving recall. Then as a product manager, you also need to think through sampling challenges because if you think about how a customer interacts with your search bar, there nobody is willingly giving their private information. Nobody is typing their address or SSN, for example, in the search bar. And so the occurrence of like, in your entire dataset, only over two or 5% perhaps data might have something that is privacy breaching, with 95% of the data not even being clean or clean. And so your data will have an inherent distribution of being massively skewed, where a very few data points are going to have privacy-containing artifacts. Still, you should scan them, but just makes your job slightly harder because I need to think through that distribution, think through what is the sample size that you will use to get it measured for ground truth, for measuring success rate of your models, and doing that at scale can be challenging as well. The other thing that I need to optimize on is how frequently you want to run your workflow. The more frequently you run, the more accurate data points you get, but of course it's going to be more expensive. On the other hand, the distribution might mean that sometimes your sample has no private data in it. And so you think that your machine has done the job and you have 100% recall, but in other times, you might not. So there'll be consistency or inconsistencies rather between different workflow measurement runs which you need to think about as a product manager. Now, we've talked about training data, we've talked about sampling and success criteria. Let's talk about what kind of model architecture you should use to do this task. On the highest level, perhaps the easiest thing to do is to put together a binary classifier, feed it a bunch of training data and ask it if it contains private data or not. It is pretty easy to implement this, but the challenge with that is it's very hard to attribute what is driving the success or lack thereof in your model. On the other hand, you could use a pattern matching based mechanism such as a regular expression to start tracking the data identifiers. Drexels are really good for anything that has a consistent schema. And in our use case, telephone numbers have a very strong schema where they need to be 10 digits. If it's not a 10-digit number, it's not really a phone number. Social security numbers are typically nine-digit and email addresses are something, something at the rate something.com or.edu or something like that, which you can easily manipulate or set pattern or rules for detection. But regexels from several downs, downside that is they're extremely dumb, they're really hard to generalize. And for example, if you have a missed, if you have a number that is nine-digit, is it really a social security number or is that a telephone number that had a digit missing? Regexels can't tell you that. And regexels are also hard to expand and scale, especially as your business becomes, expands across different locales, different regions, different languages. You need to create a custom rule every time we expand. Then there are named entity recognition models or NER models as they're called that are very good with contextual data. There are a few examples of that, Stanford NER stands up there and recently state-of-the-art transformer-based architectures like BERT and others. But these are ideal for our use cases, especially for names and addresses that rely a lot on the underlying context. But implementing this and hosting this is very computationally expensive and can require enormous amounts of data to get good accuracy. But I think the overarching point of the slide is that there is no one glove fits all solution here. It all depends on what your use cases are, what you're trying to accomplish, what your goals are rather, right? And trial and error-based experimentation is the key to choosing the right architecture. For our exam, for our use case, we will pick NER-based models for names and addresses and for others, we will go with the regex because it's easier to do. And again, you're starting off with a V1, so this is what we will choose to implement. I won't spend a lot of time talking about infrastructure. This really depends a lot on what your company is already using, what expertise does your team have. But all public cloud providers, AWS, GCP Azure, you name it, they have infrastructure for model training, testing, hyperparameter optimization, hosting and ensuring that models work across time or MLops. And the framework to choose from whether it's Pytorch or Keras really depends on what your scientists are comfortable with. So this is something that you should make a joint decision with the team implementing this solution. And yet another part where it's very important to think through especially as a product manager is how do you plan for improving the model continuously in the field? You need to retrain your model periodically so that you're picking up changes in the search queries or search patterns over time. For example, with COVID happening, the search patterns might have changed for if there's a high inflation happening, maybe the search patterns change again. So it's important to know and keep your model up to date with the latest in the industry and academia. You have to ensure that you're tracking the model performance metrics and regularly as you know, when things are not looking good, you can quickly attribute it and fix it. And at the same time, the other side of the puzzle is you need to optimize the training frequency too because every time you train your model, you will incur a huge cost. And so what is the sweet spot that gives you up to the data without costing too much is something that you need to make a decision of as a product manager. Models left unaddressed can drift in their performance over time. So you should watch out for that as well. And something that we've talked about in the past should also start tracking the labeling quality, like how much success you're having with getting a good crown truth and optimizing the labeling workflow as you expand into new languages or locations. Few, we are done through 75% of the webinar. Are you guys still excited about it as much as I am? For the final part, I will leave you with a few tips and strategies on how to become an effective product manager for privacy. And from the outside, privacy can seem very technically challenging and very ambiguous. And it is. But in my opinion, don't let that deter you because becoming a product manager for privacy is perhaps the most rewarding opportunity you can get because it is truly differentiated and a distinguished opportunity to stand and to stand out and lead the team through this ambiguity. People, there is not a lot of industrial knowledge about privacy. And so it is a good time to bootstrap yourself and your company and your organization around what the best practices are and how your company can do a good job becoming privacy first. Privacy PM requires pure product management from dealing with ambiguity, coming up with the use cases, prioritization matrix. It has it all. It is the dream job, if I may, for a product manager. There is tremendous learning opportunity where you're working with a purely cross-disciplinary team. There are aspects of machine learning, machine learning infrastructure, security, legal, compliance, and all of that have come together to create this branch of privacy. And being part of that, you can learn some really specific skill sets that will empower you for your career ahead. Especially if you think about the massive adoption of data and AI, it means that privacy will go hand in hand with it and as adoption, as AI becomes more ubiquitous, so will privacy concerns. And this is a great time to jump in and be involved in this field. And lastly, from personally, which I find the most important is that becoming a PM for privacy gives you a chance to actually create positive impact, ensure that your customer's data is being safe and secure, and really make the world a better place. I mean, what else could you want from an opportunity? Here are some of the strategies that I wanted to share that will give you more leverage within the organization that you're operating in. The first and foremost is that you need to set up a very exciting and appealing non-starvation that gets people excited. That is something that a goal that they look forward to. And a great way to do that is to start with the goals. What are you working backwards from? A good goal strategy could be working backwards from customer promises, which is what kind of customer promises or guarantees would you want to make as an organization three years down the line, five years down the line and working backwards from there and building out an incremental roadmap that gets you there is going to be very important. The third thing that you need to do is really quantify the impact on brand from doing privacy first initiatives and ultimately tie it to an organization's business metrics so that your top leaders are able to connect the dots here. The more appealing, then you should tie yourself or find partners within the organization that care about privacy as much as you do. It's typically the CSOS office, so the chief information security office, and typically the more senior leadership who really care about the long-term vision and perception of the brand will care about privacy. So reach out to them, it's a great way for you to get credibility and visibility at the highest level but also getting them to be your partner will help drive your programs and make adoption plan much easier. Once you're starting out, you should ensure that you have put together a dynamic team of cross-functional people and more importantly, curious people along on this journey because this is going to be tremendous ambiguity and curiosity is the best feature asset that you can have as a team. We've talked about this earlier but building out an incremental road maps that doesn't take a huge leap but goes from, they make small and quick wins and deliver them to your leadership is going to be key here. Keep your senior management updated on what progress you're making, provided continuous visibility so that they know what the impact has been so far. And lastly, think about incentivization mechanisms because adding noise to data or creating sanitized data sets might not be exciting for the machine learning engineers or scientists in your team but it's good for the company. So how do you balance these two things? What are the right incentivization mechanisms as something that you need to think through deeply as a product manager? Where do you begin in the organization? Well, I would recommend that you start with following the data, right? Start by charting out the customer data lifecycle where all the end points that data is being ingested. Where is the data being stored? Who has access to that data? How long do they use the data for? Where do they copy the data, if at all, right? And where is the data deleted? How long does it take for the data to be deleted? And charting all of this out will quickly give you a sense of who the actors involved are, where are the humans, what are the use cases and start bubbling up these surface areas that have high privacy threats. So you should start thinking about creating a threat map next. Once you have a threat map, then you can key identify the top use case that you want to go and cover in the V1, V2, V3 of your solution. And that's when you should start thinking about the privacy versus utility trade off that is inherent in the dataset. And start by identifying the drivers and define success metrics that you can measure and control. They are some of the best practices that I found very useful as a product manager to stay up to date in this field. First is need to be plugged in into the academia and industry what's going on so that you are up to date with the latest technology that you can implement on your own. Second is to actually build out a community within your own organization of like-minded and curious people so you can work together on ensuring privacy first practices. Do attend bunch of conferences, meet like-minded people, share and exchange things that work for you. And lastly, there is no one-gloved solution. Experimentation is the key here. Here are a few resources that I think will be extremely helpful as a product manager to get your hands dirty and really understand what's going on. They'll be part of the presentation so feel free to take it offline or look at this at a later point of time. But thank you so much. It's been a pleasure sharing my knowledge with you. I have shared my LinkedIn contact. It can be found from my name Pushpak Pujari and you can follow me on Twitter at Pushpak Pujari. And I look forward to hearing from you on getting some feedback and hope to stay in touch. Thanks now.