 Hey, I can see that Rohan Prabhu is joining us on the Zoom link. So we should be able to start the first talk of the day pretty soon. Hi Rohan, hope you're great, doing great. So folks. Hello everyone. I'm Rohan Prabhu. I'm a director of engineering at Jupiter Money and today I would like to talk to you about how we handle BII and SPDA reduction. So before we begin, a bit about myself. I was one of the first employees at jupiter.money, so you know, being there from the beginning, being in tech for about eight years, give or take. I was a co-founder and CTO of MITRE.IO prior to this and before that I have some experience working at Google and Amazon. So coming down to it, what is BII and SPDI, right? So BII is what we call personally identifiable information, so information that can personally identify you, like your phone number or your PAN for example, anything that you would use up as a lookup key and can narrow down to a specific person. So that's what we call BII and SPDI or sensitive personal data or information is information that cannot pinpoint to you. But you know, if it was released as part of your information, it could be something sensitive that most people might not be comfortable the world knowing, right? So you're an income group, your marital status, community, etc., are examples. So another question is what is redaction, right? One might think that it's about, you know, when you get your PAN or your other, you see a couple of X's in the number, right? So that masking is not what we call redaction. So redaction is basically when we put in data in a way that is highly governed and most of our microservices or most of your data stores that use or process or collect this BII or SPDI data points, they do not, they do not store the data in plain text, right? And so why do we need something like this? So prior first is because we're in the Pentax space and we also onboard customers to open actual accounts, we are required to handle sensitive information as part of customer KYC for example. The other problem that we face with this is because of microservice architecture, we've got multiple services, everybody having their own data store and then just putting in all of the data that they receive. And so it's very easy to lose control of their customer's information is being stored. So governance is actually what is the real motivation behind BII redaction. It's the basic idea is that anything that might be BII or SPDI relating to a customer, we bring in together and then we store it in a place that we can tightly govern. So probably like a single store with a very high and elevated level of access or at the same time, you know, you're also talking about when this data is in the application space, it does not rely, it does not reside in plain text. So even if you were to accidentally log it, you were to accidentally transmit it, you're still sending out something that's really not decipherable or not something that can trace back to the actual BII or SPDI element. So the basic schematic is this, right? And this is quite common across microservice architecture. So you can see that at one place, you have a microservice and then there's a standard gateway through which different applications or your customer endpoints interact. And when the data comes in before it actually hits the application within the application realm itself, we have a redaction layer. So in this example, you can see there's a pan number coming in and then the redaction layer does its magic and it basically turns it into something that's kind of garbage. It's not something that's very legible. It's not something that you can make sense out of. And within the application realm, that's what it's dealing with. Within our schematic, we use DynamoDB to store this and this intelligible form of data is what we call as a tokenized form of data. And of course, I'll walk you through more details in the coming slides. The other thing that we also need to do is some form of talentization. So we partnered up with a couple of other financial services providers and each one of them wants their customer's data to be stored separately. And when I say separately, it doesn't necessarily just mean two Dynamo tables, but it could literally mean two different AWS accounts itself within which DynamoDB is running. So this is kind of an example where because of the reduction layer, we are in a position to be able to put in a persistence, which talks about where is the actual tokenized form of this data going to be stored. But when it comes down to the application realm, the application as you can see in the lower right corner, it has a tokenized data. It has the data of all the user together across different partners as well, but it's an adapted form, right? And the actual data that was being stored in our folios, it's only there in those tenantized options across two different AWS accounts or two different DynamoDB levels, depending on what kind of talentization we want to achieve. So the implementation is quite simple. There were multiple ways to do this. What we decided to, the route we chose to go was put in a reduction layer, which is essentially in the serialization and deserialization layer. And that's kind of because that's a common interface points, whether you talk ingress or you talk egress, right? You're calling a controller somewhere, you've got to deserialize all of your incoming request objects when you are making an API call, you have to serialize those API objects. So that kind of seemed like logically right place and also the simplest place to implement something that would perform our data level manipulation just across the whole application. What we do is your serialization and deserialization layer we use Jackson, is when this process is happening, we basically pick out the important data elements from the data and then we tokenize them. And when we say tokenization, we basically use a secure hash algorithm and using that to generate a hash and then we store it in our folio. A folio is essentially a place where you store the token with what data it actually tokenized. And this all of this tokenized data goes in DynamoDB and on DynamoDB we have strips here monitoring. So if anybody tried to get access to it, anybody tried to get changed their elevated access to basically get access to the CWS upon them, which it was stored. Anybody actually even just tried to access it using AWS console, they won't be able to, but even that attempt kind of gets locked. So that kind of is the whole point of governance, as I mentioned, right, is that we have strict governance on just this one entity. So that is just this one entity that we have to worry about. Coming to tokenization, why do you secure hash algorithm? Well, the one thing is it's irreversible. So you really can't, you know, brute force your way back. We use secure hash because it makes methods like rainbow using rainbow tables or some kind of other brute force mechanisms extremely difficult or almost impossible to use. But the great thing is that secure hash always works one way it's not entirely randomized. And so your lookups and integrity constraints work really well. So if I store a phone number and then I say, you know, hey, give me the user who has this phone number. In both of the cases, the phone number is going to be tokenized to the exact same token. So, you know, your database lookup, if you had a select star where phone number equal to some string, those kinds of lookups will still work. Your integrity constraints will still work with the tables. If you say, you know, this field and this field needs to be the same. It will be the same as long as the original data being tokenized was the same. So the developer experience, because of course, it's a very, very critical part of us. We have to roll it across so many microservices, so many service owners. We have built it as a drop-in library and you just like put that library in your creative dependencies and you add in a bean modifier which will configure Spring for you and then think you have interceptor. So then if you are building K plane clients, that is where, you know, you just add in this interceptor and it will essentially do all the magic for you. So to reiterate the principle is simple. On ingress, we tokenize sensitive fields and on egress, we do tokenize sensitive fields. So I'll run you for an example. To set the stage, let's say I have two services an external data service and a user registration service. And what you can see here is the request and response objects essentially for these services. And on the registration request, there is a field called pan and what I've basically done is declared it as a PII control data element. So I've assigned it a data element identifier which in this case is data element dot pan. And now there is a request in the other service which is the external data service. There also you have a field called pan. The field name could be different. But what's interesting is that both of them actually have the same data element identifier. So using that, the library or the entire PI redaction piece and our figures out, you know, like these are two equal elements. So, you know, if even if they were lying on two different sides of ingress versus egress, it could know what to de-tokenize and where to de-tokenize from. That's kind of the basic idea. So using this example, I'll run you through a case. So let's say this is our setup, right? You've got a user registration service. You've got an external data service. The user registration service has got a database where, you know, it stores the user data pan name, et cetera. The external data services, let's say it's our gateway to all of our other external data providers of which NSDL data provider is one. And on the top right corner, we have the folio which basically stores tokens versus whatever data they actually tokenize. So a request comes in to the user registration services like, hey, register this user. It's got a pan, it's got a phone number. When it moves into the application logic, just before it moves in, the reduction layer does its magic. It adds those two tokenized entries into the folio and it morphs the registration request. So as long as it's in the application logic, this is what it's seeing, which is why when it generates the next request to the next service, you know, the pan values are still essentially the tokenized forms. And then when egress happens, the feint line, the interceptor has the redaction piece also there, which basically does a lookup from the folio, puts in the actual pan. When it is coming down to the external data service after the controller layer, before it hits the application rail, the data is again redacted. And that is what you see. It goes down to the feint line to make that external call. And while it's making that external call, again on egress, the data basically gets de-tokenized to what the external data provider is actually expecting. Coming back, you get a response with the name. Again, it's also redacted because, you know, I would have annotated it with a PII control annotation. And let's say at this point in time, the developer has, you know, some kind of a debug log, something that's been left behind which says, hey, we got user's name. But because in the application rail, the actual name was never present, what you see getting printed is again this garbage tokenized value which, you know, is kind of, it's intelligible. You can't make sense of it. It doesn't matter that it's there. This response goes back all the way. Again, on any form of egress, we always de-tokenize and within the application rail, it's always tokenized again. And at this point of time, let's say, you know, this application logic is that, you know, if the registration piece works well, we store these details in the database. So I've stored the phone number and I've stored the name, but because it's happening within the application realm itself, as you can see, the phone number and name are all tokenized formats, right? Like nothing sensitive or nothing PII about this particular piece of information. And then we get the response back and when you get the response back, the name is still redacted because we're still in the application type. Just before the controller is about to send this data out, this final tokenization happens again and then this is something that the user would see. So coming down to it, at the end of all of this, like we saw across two microservices, there was an external data provider. Now the external data provider doesn't understand our tokens at all, but we were still able to give the external data provider exactly the data that we needed. And within this applications, both of these, so within the organizational realm, whatever happened, whatever state changes were made, whether it was by making an error in your slot, or maybe the lock would have been engaged as well. Doesn't matter. Or any database entries that were being made, all of them still have the tokenized information. Again, not something that you could get back to the actual user's data. Now, if we had another API which said get user by phone number, you know, the integrations into Springwood, again, you know, do with stuff, change the phone number to something and because it's a secure hash algorithm, we would get the exact same phone number that you're seeing, right? The one starting with JEE and your database lookups would still work. So our primary key lookups, our basic integrity constraints are kind of maintained this way. It doesn't matter which part of the application it is in, as long as the input data was the same, the token would be the same as that. And of course, you know, like this solution does not come without its caveats. So one of the things that you might have noticed is that data on-wire is kind of unredacted because on egress, we always de-tokenize it. So while we've got these two services when they're talking to each other, the data does get de-tokenized. So kind of on-wire, the data remains unredacted. How we currently alleviate this is that our all internal networking is provisioned via strictly controlled and standardized N-Y configurations. The logs again have very strict CM monitoring. So nobody has access to our N-Y and proxy logs within the network. So that is one of the ways we still save card our customer data entirely, but you know, it would just make you put be a great way ahead if we remove that part as well. One other major problem that we face from time to time that library integration is kind of required. So, you know, it imposes limitations on tech choices. You've got to stick with spring unless there's a library integration for something else. The serialization, de-serialization is kind of very strict, right? So it's got to be Jackson because it's a little deeply integrated with it. The other issue is what if you were using things that do not rely on serialization, de-serialization. For example, if you were just writing to a file, we would highly discuss writing to a file, but you get the idea it unless de-serialization and serialization is part of the data transfer process. And again, using only the libraries that we support, we cannot implement PR at action over that. So, you know, that brings in a hard dependency on using these things. Validations and transformations are of course tricky, right? So transformations like sometimes very simple transformations like phone numbers, for example. Sometimes you want the request to contain a phone number without the international code and with and before calling an external partner if you want to pre-pend 911 do it or the country code do it. So those kind of transformations are really tricky. Validations are of course tricky because you can't do validations of the form. Length should be between 10 to 50 characters, for example. So because all you have within the application realm is just a token. So that is kind of a little difficult to do. The way we have gotten around it is we have kind of a validation and a transformations VM. So what it does is you can, on a given PIA control field, you build like a set of instructions that have to be performed. Now, validation instructions are basically check kind of instructions for transformations are, you know, string to string kind of transformations which could do it for you. And all of this processing actually happens in the redaction layer itself. So what's happening, it's not a lambda that you're passing with which the developer could kind of get access to this data. But it's happening within the redaction layer itself and the developer is just saying, you know, like do A, B and C, not giving like actual functions to it. The only problem with that being as we come up with more validations, as we come up with trickier transformations, there's a slight maintenance over it. One of the most interesting things that we have experimented with and we want to do next is to implement this other service mesh layer itself. Now, we use NY and STO and NY has this cool feature called filters which they write using Wasm and Rust is a target, is something that compiles to Wasm as a target. And we explored that. Currently, NY filters are in beta and also the crypto support for Rust, Wasm is in print. So that kind of really limits us into how much we can do. But if you are able to do that then that could be like zero dependency on the service itself. And the reason that is great is because even today, the redaction layer kind of still resides within the application trend itself. So it's kind of still the same memory space and now this would just completely remove any dependency on the service owner or the maintainer itself to keep the PI redaction piece afloat. So this is one of the most exciting things that we've been experimenting with and we definitely think is the way ahead. The second thing that we want to do is what if our filters could understand an organizational realm rather than an application realm. And that would be like very simple because what we could do is within, when two services are talking to each other, we don't have to do the de-tokenize at egress and tokenize at ingress again because it's still within the same organizational realm. So when these two services are talking and the organizational realm is an alive concept, then the data could flow within microservices completely tokenized itself. And that could just contribute a lot more to correctness. And then thank you so much for your time. That was pretty much what I had. Would love to hear your questions. Hey, thanks Rohan for such a great talk. A lot of technical details in the talk on how to actually start redacting the data. And thank you for joining us right now. So first I will go on to the audience and check if there are any questions that the audience wants to ask right now. Let me check the Q&A tab. Okay, so I don't think we have had any questions from the audience on Zoom as of now. And no questions on YouTube live stream either. Okay, so I think we have one question coming up. Lava Kumar Kuppan asks that, do you have any variable naming convention so that by reading the code, you know, which variables have redacted PII and which ones have unredacted PII? Or do you use some other means to achieve this? Rohan? Yeah, so great question. We don't have conventions necessarily on the variable names because that would again, involve deeply ingrained process and an alternate process where some convention is to be followed and with these processes, the biggest question is who is going to enforce it, who is going to be the checker and maker of this, right? So as we had shown in one of the slides, right? Like the idea is that we use annotations and data element identifiers are kind of global. So we maintain a list of common data element identifiers and you know that these are essentially like strings or URNs or you could use any form of naming convention that we can say that, hey, this is a redacted information for me, right? And then transformations happen within that. But an interesting corollary of this question that I would like to answer is how do you figure out where you might have PII data itself, right? So Macy is kind of like an amazing tool and there are kind of like other amazing tools also, right? Which basically monitor the things that you're doing within your login frameworks, the things that, you basically send your logs to a centralized place, you know via cloud watch twice three would be one example. But what's fantastic is that the more data sources you put in, Macy would give you alerts if you're accidentally putting in data that could be PII. And the greatest thing is you should be confident in logging everything and anything because if you do have PII coverage, then you should absolutely have no issues with logging everything, right? And then Macy does like a good job of telling you and are there still elements that you have skipped? So, you know, that is something that we monitor to continually improve our process. Makes sense. So yeah, I think Lavakumar has a follow-up question. They're saying that, can you talk about how you are handling PII data sent to third-party clients from the client side? Like third-party data migration, like data forwarding from web app, JavaScript or mobile applications? So we absolutely don't do that. So it's almost everything is proxy by, and as a fintech partner that comes in as a limitation as well, and also like a great thing because then, you know, we can enforce our practices and governance as well in protecting our consumers data. So directly, if we were to send analytics, I mean, like the analytics at least that we sent are completely anonymized and, you know, what they eventually tag to and which user they tag to kind of happens as a background and a backend reconciliation process, but we do not send any PII data. We actually send no user data directly from the client apps to third-party services. Which is I think a good practice that a lot of organizations do need to follow. Third-party data forwarding should always be curated and an explicit choice in a structured format rather than just deciding something to send, right? Another question asked by someone anonymous is that is data reduction a CPU intensive process? If yes, how does it affect latency, cost of infra and how to optimize? PI reduction by itself is not a CPU intensive process, not generating the hash. Of course, takes up a couple of your CPU cycles, but if you actually sequence the things that you have to do to, you know, get to a tokenized form, I think the bigger penalty is what you pay on network IO for writing and reading from the DynamoDB tables, right? And that is significantly more than the CPU cycles you would take, right? Like SHA 256, the secure hash, they have optimized it for decades. So it's like super fast. It's a standard algorithm. So yeah, we don't need to wanna put the performance penalties on it. Absolutely, and you know, like so much of research has gone into like making it fast that it's absolutely like super fast. But if you talk about a penalty that we do pay for DynamoDB network IO calls, I mean, that is kind of absolutely there, right? And honestly, how you can optimize it is one of the ways we had done is initially we used to optimize at a field by field level. And now what we do is we batch the entire object together, which, you know, kind of like required us to go deeper into our serialization framework so that we can like batch our calls. But beyond that, our avenues are kind of limited and this is kind of just a penalty you have to accept, right? Because caching also is for example, not an option. If you start caching it, then the whole purpose is lost. And then the PII is maintained in the... Exactly, exactly. Somebody put to an LRU attack or something on the cache. Yeah, a lot of things could be done on top, right? Absolutely. I think this is a necessary cost to pay for privacy and especially in regulated spaces. And most of the organizations are okay paying that slight time cost in latency. Absolutely. Sudip, the Chakravarthy asks, how do you take care of reducing the attack surface from attacks such as SQL injection? Sure. So I'm not sure if it specifically applies to PII reduction, but I mean, you know, because I don't think that the PII reduction layer adds. I would say this question is not very relevant to the talk that you just gave for us. SQL injection prevention is much more like a framework driven thing these days. Absolutely. If you are walking with a good ORM framework, a good MVC framework. And if you're taking care of application security at the top layer in the JavaScript layer itself, you know, a lot of those things will get taken care of as long as you're following coding practices that do not allow for SQL injection, right? It's not pertaining to the current talk at hand. We can, of course, Sudip, you can ask this question separately on our comment section. And I think we can pick this up at a later point of time. Yeah. Actually, just an interesting thought I had, right? I mean, if you think about it, having the tokenization actually reduces the chances of a SQL injection attack because if you added something like an escaping code and said where one or something like that, finally it just comes to a tokenized format. Not that I would make that an argument of why you should have this, but just the thought I had. Yeah, absolutely. You know, you don't have any identifiers, directly obtainable identifiers anymore that you can do an injection and sort of iterate over that, right? Absolutely. Okay. I think folks are happy with the answers. And thank you so much, Rohan, for the talk. I do not see any more new questions coming up. Let me just quickly check any, if anyone has asked anything on, yes, there is one question on YouTube. Aravind Padmanavan asks, could you talk a little about the performance side of this extra? Oh, I think we have already covered that, right? So yeah. I hope Aravind, like, you know, what Rohan has already answered as an acceptable cost of tokenization and encryption. I think that answers your query. So yeah. Thank you, Rohan, for this great, amazing talk. And I really loved how you delve deep into the actual implementation details of this. Personally, I really loved it. Awesome. Thank you so much for having me. Yeah, we're all happy to have you here. Great.