 Alright, so I've been working on finding or identifying personally identifiable information PI and Pixie and I'm going to start off with some motivation. So as you know, Pixie is a system that allows you to monitor your Kubernetes cluster and as it does so it gathers a lot of data, potentially sensitive data and increasingly we've seen a rise of privacy legislation including in the EU there's something called the General Data Protection Regulation, the GDPR, we've also seen similar legislation in California. So users of Pixie are increasingly subject to these regulations and are having to change the way they process and store sensitive data and of course there's a risk involved. There are greater government fines now for data leaks, for exfiltration, for being more careful about the way you process sensitive information. So given these restraints, ideally we want Pixie to be able to detect the flow of PI in our users Kubernetes cluster and monitor if potential leaks are occurring and thereby preventing fines and legal battles. But to do that you need to be able to find PI. So what is Pixie currently able to do? Well Pixie can drop all columns that may potentially contain sensitive information. This is the brute force approach where you just drop it all and you may lose some potentially important information but you also drop the sensitive PI. Ideally we'd want something more fine-grained than that. We want to be able to have row-based redaction so for each data sample entry we have we want to be able to redact specific PI types. So Pixie currently has a UDF to do that but it's limited to around eight PI types and they're mostly rule-based so things like credit cards, IMEI, that kind of thing. So ideally we want to expand the coverage. And how do we do that? Well why don't we just keep writing manual scripts? You know this is how it's been done so far. Keep writing specific PI identification scripts. Well I'd like to illustrate an issue with this using an example. So here we have a sample JSON payload. It has some different fields including names, identification documents, passwords, addresses. So clearly here we don't just have rule-based PI, we actually have names. Things that can't easily be identified using something like a regex. It's not pattern-based. So this is where regex reaches sort of the end of the line and we have to look at other approaches. And one thing we can do is we can try to use ML. By definition ML is sort of a way to avoid us having to manually write scripts and manually implement rules. But what does ML need? Well ML needs a lot of data, a lot of labeled data specifically. And this what's our problem here? Well we've got some sort of input data that's usually in text form and we want to figure out whether it has PI or not. This is a binary classification problem. So ideally we'd want something that looks like this. We have a text input with some fields, some data. We have a binary label of whether it's sensitive or not. And we have the types. So unfortunately, so unfortunately what I found while seeking out some data sets of PI, it's very difficult to find open source data sets for broad PI categories because by definition it's sensitive. So unless it's a data leak, you won't find these things on the web very easily. And so the solution I came up to this issue is to generate data synthetically. So I built a tool that generates synthetic data in a format that is similar to the data that Pixi collects from a cluster. So what does this look like? Well, privy as a tool generates data from API specifications. So there's an open directory of about 4,000 APIs, everything from Amazon to Stripe. And I look through those schemas and essentially match each of the parameters within them to specific PI and non-PI data providers. So this gives us a labeled data set. And then these sort of API payloads are converted to specific protocol trace formats that Pixi collects. So say we have SQL, JSON, XML, HTML, all these sorts of protocol traces. All right. So we have the synthetic data set. A little bit more about how configurable this is. Of course you want the PI distribution to be representative of the PI that might occur in a realistic data set. Now for ML classification tasks, it is very convenient to have 50% of your records be one of the two labels that you're trying to train a model on. And the other have not be one of those two labels. So to do that, I have essentially configured privy so that it inserts additional sensitive payloads into already sensitive payloads that the schema analyzer identified as being sensitive. And that equalizes the distribution. Okay. So what's the use of this data set? Well, we use it to train ML models that identify PI and to test existing PI identification systems. And it's fairly large. It has draws from a lot of APIs. So it can generate over 100 million samples. Okay. A quick comparison to some existing data sets. As I mentioned earlier, it's very difficult to find PI data sets on the web that aren't data leaks, especially labeled data sets. The closest thing that I could find are named entity recognition data sets, which are not PI necessarily, but they're sort of a similar category of, you know, locations, geopolitical entities, names, that kind of thing. And there are some hand labeled data sets online. Either here we can see there's one very common data set called Connell. And that's consists of a lot of sentences from news articles, from Reuters, and those were hand labeled with certain named entities. There's another one that draws from Wikipedia also has named entities. And now this is free text. So this is usually not the kind of same format that Pixi collects, because Pixi has these protocol traces that aren't necessarily sentences in English that makes sense. So this is just to illustrate that privy generates data with more PI types than these existing data sets is a lot larger, although it's not human labeled, it's machine labeled, and it has some different protocol traces than some other existing data sets. All right. So now that we have a data set, what do we do with it? Well, we train a model. So what I've done is look at two approaches, two types of models to train that can identify PI and solve our binary classification problem. One is a recurrent neural network, LSTM. The other is BERT, which is the more state of the art approach currently in natural language processing. The reason I went for an LSTM to begin with is because we have some existing tokenization code that makes integration with Pixi easier. It's also a much smaller model, meaning inference time is lower. And I will demo this model a little bit later. Now the actual state of the art BERT, ideally, you'd want to take this underlying text representation that BERT has, then train on top of it using this new preview data set, and then you have better performance. And I have some benchmarks at the end that I'll get back to. Okay. So now let's get to the fun part, the demo. I'm going to open up the Pixi UI. And right now I have Sockshop running. So Sockshop, common demo that we use at Pixi. Surprise, surprise, it's Sockshop. And I'm just going to demonstrate adding a record into Sockshop, so registering a user and then having the ML system classify this newly added record as PI or not PI. So I'm going to click on register up here. And if somebody would like to suggest a name. All right, Ryan. Yes. Your first name. Chang, and your email. All right. And your password. Just say it all out. 123123. Okay, so I'm going to register Ryan. Nice. Okay. This, I do have a video. So let me just run the video as a little backup. It could be that a duplicate key is an error here. But just to show you the sort of, I'll do another live demo in a second, but just to see you show you what it looks like to register live and then run it in the UI. So in this case, I'm putting in some different names here. Then what Ryan just said. I'm going to skip a little bit. So I've registered this user. And then I go back to the pixie UI. I have the script that filters. So in this case, Ryan is there, but now I added a user called Smith. I filter the UI, and it gives me this thing back, this response body. And at the end of it, there's a number prediction closer, it is to one, the more likely it is to be PI. Okay, so that's all good. This is recording. I'm going to run the real time thing back here with this actual sock shop with already data that is running. As you can see, there are some records here, including some HTML. And you see the number classification down here. In this case, it was not super confident that it's PII. But if you choose a cutoff of 0.5, then it is. Well, let's see, this is some HTML, but it also has an author field here and the name. So this is sort of an edge case, just to illustrate that it's not always clear whether this is PI or not. In this case, the system thought is probably PII, but not exactly sure. Let's look at another example here. Okay, we've got some like cat socks here, some image URLs, doesn't seem like it's sensitive. And indeed, the bottle said that it's definitely not sensitive. So I also have just to keep scrolling down, you can see some different examples here. I have another demo running that essentially samples from the test set that privy generated. So the model was not trained on these samples, but it was generated by privy. And this is just to show a greater variety of JSON payloads that could come in, and that the system could classify. So here, if we look at some examples, okay, well, let's click on like a random one here, we've got a response body with some text. It looks like random things, doesn't look like PII. Let's find a PI example. This one seems to be one. I think I thought that Whitney in Ferguson is likely a name of sorts, or an organization name. DBA, I guess that's an acronym for doing business as, so this is probably an organization. Okay, so that's the demo. Can also test it out a little bit more later. Let's get to some quick benchmarks here. As I mentioned earlier, there's sort of the state of the art approach, which is BERT, which the demo does not currently use, but the state of the art approach usually is better, but it's a much larger model. So what did I do here? Well, I've got a test set that each of these models was not trained on, but it was generated by privy. And I needed something to compare this to. And the closest thing I found was a BERT model trained for named entity recognition. And so named entity recognition has these sort of four main types, like tagging location names and people's names, geopolitical entities. So I filtered these data set only to those PI categories, so that I can have a more fair comparison. And then I have this one model that was trained sort of on top of previous data set, and this other model that was trained on a different named entity recognition set with the same PI types. And we can see there's an accuracy improvement on this test set. Happy to talk more about this and how the benchmarks were set up if there are any questions. What about future work? Well, of course, you can keep extending the data set. You can use some unsupervised learning techniques to label this data set. You could even use the model to add labels to new data samples and then add that to the data set, retrain the model, make it a little bit better each time. Another thing you can do is keep extending the data providers that privy currently generates some more PI types. Currently around six years supported. So you could also add more language support. Currently English and German are supported. I did German mostly as a proof of concept, just because I know German. And you could also try out some different pre-trained models, which essentially means using an existing model online, using it as a baseline, and then training on top of it. So using its underlying text representation. All right. That's all from me. If there are any questions, happy to answer those. Is there any existing open source projects? Yes, so there's a project called Presidio by Microsoft, which does a similar sort of thing, but not for protocol traces. They allow you, so they benchmark some PI identification systems on free text, and you have to label your own data set. So they probably have an internal data set that's labeled, but that's not accessible open source. So I looked at that, and that's where I got some of the sort of other models that I tested and played around with and birthed state of the art. Have you given any thought to, so you get the prediction for the entire payload, have you given any thought to how you might identify the actual piece Yeah, yeah. Definitely. So that's multi-label classification, and usually it's more difficult to do multi-label classification, right? There are more categories, more chances for the model to get confused. In this case, it was mostly a time constraint that I didn't try out the classification. Privy actually generates labeled data and tells you what specific PI types are in the requests and also what categories they're associated with. So this is an extension as well to do multi-label classification. You can probably even predict the position of the PI information, right? Like where is our improvements? Definitely. I think that would require some changes about how the labels are generated just to give the specific like token offsets into the string, but I think that would be a fairly minor change to the way the data is labeled within Privy. So for the data that they used to train the model, do you use the entire payload or like specific like fields like name, blah, blah, blah? So currently I use the entire payload, but I add transformations on top of it. So to make sure that the model doesn't hone in on like a bracket or some specific tokens that aren't relevant to PI, Privy generates variations of the same payload with certain tokens removed. So you're right in saying currently it's trained on the entire payload, including the name of the parameter and the value. So what's relevant often is in these sort of payloads is that you might have a sensitive parameter name and a sensitive value, or you might not have like a sensitive parameter name, but a sensitive value. So both of those things are relevant. So if you have like, yeah, because I saw in your training data that they've generated a name, for example, can be just a bunch of symbols, right? Is that like, I guess sensitive in a way? I'm sorry, could you repeat that? Like in the generated data that you showed, I think one of the fields was just a bunch of like characters, like maybe that's something I don't know, but like name, IC, choose, blah, blah. That seems like around those training. Yeah. So sometimes for the non-PI, I needed, for certain parameters that have non-PI data, I still need to find a way to generate that data. And the way I do it is I have custom string generation methods that generate some sort of random combination of words and hexadecimal, not hexadecimal, sort of letters and numbers, and then combine them. So if your question is about where this sort of garbage comes from, it's generated by data providers for non-PI payment. I guess my question is more, so is it important the value of that specific parameter, or is it important that the parameter is like name? I see. So you're asking about how it's labeled in the first place when it's generated? So the labeling happens like this, if we go back to the privy diagram. So over here, we have the schema parser. And the API specs are given by a format called open API three. And that tells you a lot about each of the parameters, what types they might have, what, you know, sometimes there's even an enum for like the specific values they take on. And what privy does is it looks through the, you know, the description, the enums and all that and finds sensitive PI keywords. So it has like a word bank of PI keywords. And if it finds that, then it labels this as sensitive and it matches a sensitive data provider, like the appropriate one. So if it's like, you know, a first name, then it matches to a first name data provider. And that's how the value is generated and how it's labeled.