 This is a quick word on me. This is what I am. I have over a decade of experience in the field of cyber security and I'm very passionate about, you know, delaying offensive and defensive practices which include security by design, privacy by design, threat modeling, etc. I'm a very active member in a community called Null, which I'm not sure if you guys know well. And our recent project was jobs.null.community which powers, you know, free recruitment for employers and job seekers both. And you can always find me on LinkedIn and Twitter. My DMs and messages are always open. So if any questions, please feel free to drop a line. So before I get into the presentation, I have some, you know, ground rules. If you have any questions, please feel free to raise them while in the middle of the conversation. If I don't see a raised hand, you know, feel free to leave a message in the chat and I will have a look at it. Everything that I talk here is, you know, purely out of experience and something that is, you know, we've learned or I have learned over practicing, you know, different streams like privacy by design and security by design. So that's, that's pretty much it. So let's quickly jump in right now. What are the key takeaways that I'm expecting that our audience will benefit at the end of this is a we will talk about a lot of unstructured data and what are the privacy implications around it. We will also talk about privacy by design blind spots and I will give you a number of examples how these areas are often, you know, missed when you're coming up with a data governance framework, especially when you know you're talking about privacy by design. Agile practices for implementing privacy controls. I will talk about this as we move through the slides. A lot of stuff I will talk about will also be out of experience and you may not find them on the slide, but I think wherever it is important, I've made my best effort to put them onto the slide, but if not, we can always also have a chat, you know, maybe offline later. So before we begin into the unstructured, you know, data word, this is a very high level definition of what data governance is all about. This is for audience that is largely into governance privacy and compliance and similar practices. So what is data governance? It's essentially, you know, a set of principles and practices that ensure high quality through the complete lifecycle of your data. Now, if your data sits in one particular system or in one particular database or one particular location, it's fairly, fairly easy to come up with a data governance framework, and then you implement rules that make sure that, you know, achieve the purpose of why do you have a data governance framework set up in your organization. But the real problem then starts when you don't know where your data sits, you don't know what is the structure of that data. And as a result, you cannot implement uniform set of rules, you know, to make sure that you have, you know, data or privacy by design implementations for systems within your organization. Again, this is a very high level data governance framework that, you know, generally is followed if an organization has a data governance function, you answer typical questions like why, who, what, when, etc, etc. The largely to be able to answer a why do you need some data, who's going to have access to that data, when will they access to that data, and what are the means for accessing that data. But let's let's break this, you know, complex diagram into some very, very simple examples. So that takes me to all the blind spots. And again, I think a lot of this that I'm going to talk about is something that we've developed over years of experience, right. When I started the journey into helping different teams within the organization that I work for privacy by design, especially when you know the GDPR was very, very rampant and we operate in European geographies where GDPR seems to be mandatory. It was a fun exercise to start with to see every single system that stores customer data or any information that is deemed as, you know, private. In doing so, like I said, when I started that if the it's very easy to identify data that sits in a, you know, structured manner because then you can implement uniform controls. The real challenge was unstructured data. And I'm going to give you some examples in a bit. The other invent the other blind spot that you know I discovered during this process was inventory of systems that actually store process and transmit data. And once you have a look at these examples, you will realize that, you know, these are systems that we don't even consider when you talk about data governance or data privacy or privacy by design related stuff. I will touch upon some examples on, you know, masking, tokenization, anonymizing or encrypting the data. If I were to summarize, you know, all the blind spots that we've come across through our practice of privacy by design is you cannot protect something that you don't exist. Right. I mean, it's a very, you know, it's common sense that if you are responsible for protecting an organization from external threats, you cannot protect that organization. If you don't know what is the exposure of that organization on the internet. So, I mean, that's just my way of, you know, summarizing the different blind spots that we have. So let's move on, right? Let's move on and talk about what unstructured data is a very simple definition is that if it does not have a predefined data model or a schema for me, it's an unstructured data. And I'm sure that, you know, it's a general definition for any kind of unstructured data. Now, some examples are sensitive data that is exchanged over emails, documents, text files, it could be anything, right? It depends on what is your function in that given organization. For example, if you are somebody who supports a product and you often deal with a lot of customer, you know, questions, then you are extending a lot of data over emails and documents. And, you know, it's very difficult to put a structure around it. Sensitive data stored on public S3 buckets. You can find several, several examples on the internet of how people have managed to find sensitive data about a given organization on, you know, in a public S3 bucket, which does not have any, you know, correct ACS configured analysis is accessible to anyone. Data shared on social media. This is very interesting, right? Now, a lot of my examples in this presentation are relevant to, you know, my organization. The reason I say that is because, for example, you know, let's talk about data shared on social media. Now, these days, we often exchange a lot of information with our suppliers. So for example, you know, if you want to tweet to your electricity provider, then you, you know, go ahead and tweet and in doing so, you might end up giving a lot of information to them, which is something that, you know, in principle, you should avoid. If you're tagging your friends, if you're tagging, you know, images, which may have, you know, location information on them, etc. All of this is basically, you know, data shared on social media. Data shared indirectly with suppliers or partners. Now, while you may know that your organization is sharing information directly because there is some form of a legal, you know, binding between you and your supplier. But what about the data that you share indirectly? What that means is that you start with a certain initiative saying, saying that this is what the data that is going to be consumed as a part of my product life cycle. But in doing so, other information as well, which makes your data more and more richer, you know, from a privacy point of view. Now, Facebook is a classic example of this, that it's not just you who give information to Facebook, but it is also all the people who advertise their products on Facebook. Facebook is giving them your information. And as a result, you see a lot of tailored experience in terms of pointed ads, you know, that come up on your Facebook pages. So let's, you know, break this down. Now, if you think that your data is only stored within an rdbms thing twice, right. So here I'm going to give you some real examples, as well as I will give you samples of how, you know, these things could actually have privacy implications. In my experience, it often happens that developers are never exposed to the entire life cycle of the product. What that means is that as a developer, if I'm writing code for a certain product, I have no visibility as to how these are exposed to the Internet. What kind of logging is configured for these applications? What are the kind of analytical trackers that are implemented? You know, what kind of other activities are carried out, you know, by my business that works on these kind of products. So as a result, because the developer visibility is minimized and is not, it leads to blind spots in a lot of areas, and we are going to talk about some of them. So poorly designed APIs are very, very common ways of leaking sensitive information about, you know, your customers or anything that is deemed potentially sensitive by your organization query strings. There are some very interesting examples that I will give to you in a bit. Data analytics systems, I'm not quite sure of, you know, how many people in the audience are well versed with, you know, full blown data analytics systems that can actually record an entire user session in their browsers and give opportunity to your businesses to be able to replay that entire session and come back and give you a more tailored feedback or an experience. I mean, this area in my experience has been a big, big privacy nightmare, if not implemented correctly. Again, something that the developers are often blindsided is, you know, the kind of logs that get stored within web servers, proxies, load balancers, errors, web application firewalls, et cetera, et cetera. So everything that I'm going to talk from now on is going to be related to this slide and the way I've structured is that I've come up with a case study. And this case study is something that we will keep often referring to throughout the presentation. And please feel free to raise questions if you have any, you know, doubts or, you know, you want to add something in. So this is, this is what my case study looks like, right? It says the business requirement is very simple. I need to design an API as a developer that will return all the attributes that I have about a given user after successful authentication. Now, consider a system like Facebook again, where once you authenticate Facebook, you always have these, you know, websites that power login by Facebook login by Google, et cetera. When you click on login with Facebook, it basically returns a very specific data set and it asks you for your approval before it is shared with, you know, the actual vendor or the actual website that you're using login with Facebook as now. The other requirement that I have is there could be a number of partners that could consume this API and each partner needs access to different attributes within that data set. So for example, if my store, you know, 100 attributes about a given user, when they register on my website, partner one needs access to only five, partner two needs access to six, and so on and so forth, right? So this is, this is what the business requirement is and, you know, let's look at possible solutions of how you could design an API like that, right? So the first one is let's say that I create one API that sends all the user details upon successful authentication, and every partner can choose whatever field they want. Now, what I mean here is that if I have 100 attributes about a given user, and I would successfully login and in my head, because there is an NDA or a non disclosure agreement signed with the vendor, I have no problem giving them all the 100 attributes. But the vendor will in turn end up using only those that are required by them. Now when you're doing something like this, like those important one, you know, being that time to market. If you tell your business that I have an API that you can actually consume and give it to all the people that you work with or all the partners or suppliers or vendors you can quickly work with. My time to market is extremely, you know, less, I have less over it and quicker turnaround. So even if I have to make like the slightest modification, I just make it in one place and it's applicable to everybody. Now, the biggest challenge in doing this is it's not data privacy friendly. Let me give you an example, right? So for example, you have one vendor that only needs five fields from the response of your API. But because you're giving them all under it, there is nothing that stops the vendor from iterating through your database and storing all those 100 attributes about those users. What they do with that is completely, you know, a different discussion, but this is a potential, you know, area of concern if you are somebody who's very privacy, you know, minded organization. The other way to solve this problem is you could create multiple versions of the same API for each partner and send them only what they have signed up for. Now, this is a very good approach, right? Because even if I have 100 attributes and let's say that one partner needs five, I only send them five. If the second partner needs six, I only send them six, which means that I have full control over what I can share with each of these vendors. Now, the biggest problem with this approach is time to market, which means that if your business follows a lot of agile practices and, you know, they onboard new vendors on your platform very frequently. Then this is going to be time consuming because with every new partner that they onboard, you have to create a new version of that API and then map the fields that are required. Of course, this also means it's a bigger override and turn around times and something that your business may not necessarily appreciate. Now, considering the different technologies that are available in the market, the third approach could be that you create one API, but you create multiple views on top of it for each partner. Now, I'm not sure how many people are developers in the audience, but if you know technologies like GraphQL or if you know Facebook's Graph API, this is exactly what they do, right? They create one API and they create multiple views on top of it and then depending on what the partner ID is, you only send them the data that is required. Now, I'm not sure if you guys are well-bursed with the whole Cambridge Analytica and the Facebook scandal where Facebook has this approach to all their APIs. They create, they have one version of the API and they build multiple views on top of it and then they allow Cambridge Analytica to pull a lot of information. About Facebook users, which they would potentially not really give access to other people. What are the pros in this process? It's very efficient process. It's very scalable and it's also very data privacy friendly. Come on, because you're giving only access to the data that is required. The only challenge in this is and that is something that you learn out of the Facebook and the Cambridge Analytica scandal is governance and oversight. What this means is that there is no way for you to not know what data is accessed by a given vendor if you have the right checks and balances in place where you're reviewing the responses that are sent to individual partners on periodic basis. As a result, the cons are governance and oversight. But hey, governance and oversight are fairly easier problems to solve when you know that the underlying platform that you have built is foolproof or you know for sure that it does the job that it was designed for, which is allowing you to create views and only extending that information that is required. So this is a blind spot, which we have often come across that you design APIs, but these APIs are sending information back, which is not intended for sharing or was not a part of your agreement with the vendor or whatever your agreement with the vendor or supplier was. I mean, this is personally my favorite one. Let's talk about this, right? So there are a lot of functionalities in your applications which require you to deep link. Let me give you for example, if you see here, you have a service online in which if you go back and click, I forgot my password. They email you the password link on your email with a token. Now, when you click on that link, sorry, when you click on that link, it takes you to the password reset page. Now, if you carefully see here, there is a code which is nothing but a password reset token. If this token is valid, then it will allow you to reset the password. This is how typically password reset functionalities are designed. You send an email to your customer, your customer will click on the link, land on the page. If the token is valid, you will be able to allow the user to reset the password. Now, what happens actually at the back, right? And something this is often the areas where developers don't realize, you know, the privacy implications. Now, the browser security model simply states that anything that you send in the query string A will always be visible and B will always be logged by a web server. Right? So let's say for example, you have two parameters that take a username and a password. And if you ever do a get with that request, whatever is the fragment of the URL will eventually get recorded as a part of your web server logs, which means that whoever has access to your web server logs could also potentially have access to the password reset codes. Now, if you look at two important problems with this approach, is that whenever, sorry, when the page loads the entire link along with the code will be submitted to any trackers that are implemented on the page. Now, what does this mean? What this means actually is if your organization or if you're building a product that is having this requirement where you want to track, you know, user experience and you want to see how many users actually use the password reset functionality for whatever reasons, right? Or you build a tracker that is there on the page every time somebody visits, it collects certain information that is available in your DOM and submits it to the tracker. Now, this tracker that you have may not necessarily be designed by you. The most common example and most widely used marketing trackers on the Internet are through Google Analytics or through DoubleClick or to Kretow. There are a lot of ton of, you know, these providers on the Internet. These trackers also have the ability to inject JavaScript into your browser if that's what you've designed them for. And again, an area where you have to be very, very cautious about what they are capable of injecting back into your pages. Now, the other common area that I've seen often, you know, is an oversight is if on this page when you click on the Save button, whatever is in your browser URL actually gets submitted as a referrer to the next page, right? Again, if I were to explain this in detail, you have a link to the left which allows you to reset the password when you click on it, you get to the page on the right. And when you click on the same button, whatever is the page that you go next, this link automatically becomes the referrer. Now, from a privacy point of view, there are a number of things that could potentially go wrong here. Let's say that your password reset functionality was not implemented correctly, right? Which means that once the user clicks on the link, you don't expire the token or the code in this case. The other mistake could be that if your business is very keen in terms of making sure that the users have the best of experience, these codes, they live for longer than 24 hours, which means that I request for a password reset and I will continue to keep the link active for the next 24 hours, right? Even if the user has clicked on the link and that link is having a data that is then submitted to a tracker and anybody who has access to the tracker will potentially have access to your code links as well, right? So as a developer or as an organization, anything that leaves your vicinity or your boundary of systems is out of your control. What happens to that data is something that is very difficult to govern and as a result, it becomes very difficult to implement privacy-related, you know, design controls on such pages. Now, the black spot is that the data that is sent in the query string will always be present in the web server locked or leaked by a reference and submitted to trackers implemented on the page. If the request, of course, is visible in clear text. I am not sure of how many developers are well versed with systems like App Dynamics or Quantum Matrix or Full Story or IBM Tele for example. Now, these are systems that allow you to implement some kind of tracking onto your web page that allows them to record the entire user session. They then convert this session into some sort of an HTTP request and then they post this information to the backend. Now, from a business point of view, they never want to lose an ounce of business just because the user had an unintended or unexpected experience on that page. Let me give you a very simple example, right? It's very common in an airline industry or for that case, even on any e-commerce platform that if a user puts something into the cart and if they go all the way to the paving page and for whatever reason, they are not able to make a payment successful. Your business would definitely want to know this. The reason they want to know this is because they don't want to lose the revenue on that user. Now, somebody on the business side, when they identify these kind of events, they will try and tailor an experience in a way so that next time that user goes back or logs in again to the application, they will continue the journey from where they stopped. And these kind of systems are very, very useful and powerful but at the same time, they are a massive privacy nightmare. Now, I've taken this video, it's about this app called as Full Story that has a similar pattern. So this will just give you some sort of a feedback or just give you a look and feel of what these kind of systems are. So what you see to the left is a simple page in which a tool like Full Story is implemented and you will see that as the user enters information on the left, how they get recorded as an HTTP session to the right. Now, you will see here to the left, there's a lot of information that's available and given or supplied by the user and to the right, you will see that as and when this user actually modifies any information. This app at the back of this API implemented on your website is actually collecting all of this information in real time. I've also given a link in case if anyone wants to go back and have a look at it later on, but these systems from our privacy point of view are a big nightmare if not implemented correctly. And you can see here that it, you know, it not only does passwords, but it's also collecting, you know, your card numbers and bunch of other things as well and they are getting almost recorded in real time. Now, why are we discussing these systems is because of this right now, I will allow you to read these links after the session, but at a very high level. I can tell you what these issues have been and I have, you know, my team back at work we've tested a similar system and we've come across some very interesting observations. So I think both these incidents were very good examples of how unintended data can get captured by the actual vendor or the owner of the website. So in this case, let's say it's either Facebook or Flipkart or Amazon or whoever it is. And then what do they do with this data at the back? Now, while all of these systems, they have one thing in common where they actively talk about a lot of privacy controls that are there in their application. But at the end of the day, you can only trust them for what they say, right? And from my experience, what I've realized is that, yes, it's good to trust and have these kind of implementation done, but it's also very important to verify what the claim has been made here. And verification could be as simple as you set up your app, integrate this tracker and then you do an assessment or a review from both the sites while you're actively submitting the data from the app. You see what's happening on the other side and then you find ways and means of compromising that piece of information or trying to get unauthorized access to that information. And that's how you will be able to tell yourself how foolproof this technology is or how foolproof that solution is. But I would highly encourage you guys to read these articles because they give some very interesting insight on how different organizations have implemented it. I think the first example if I recall well is related to Air Canada where they had a similar feature technology implemented in their app and then a privacy researcher identified gaping holes in the way it was implemented. This wired article talks about how organizations like Google and Apple are working very closely with their developers to ensure that if there are any such technologies implemented within the app, then they have to fully declare and they have to follow some accessibility rules and guidelines to be able to make sure the user is aware of what data is captured from their screens. Moving on, this is one area which is where developers are often blindsided. For example, you know they come up with designs of get posts request or any restful API for that case, but whenever they try sending the data in the query string. They don't know that it is potentially getting logged everywhere where the SSL is intercepted or offloaded. What that means is that you have perimeter technologies. For example, you know, if you talk about F5 or Akamai or any other web application firewall service provider and vendor in the market, you need to offload your SSL onto those devices so that they can carefully examine the requests for any potential anomalies. And if everything seems good, then they forward this request, you know, back to the actual origin server. Now in doing so, if you pass anything that is potentially sensitive, then you will end up showing all of that information, you know, in the log for those kind of devices. And I've listed some more, you know, you know, technologies, like for example, very big organizations which, you know, give access to their employees for Internet implementing proxies, they often SSL intercept your traffic. That way, they know everything that is going in and out of the system. So again, you know, if you're passing anything that is sensitive in nature, you want to think twice because you know that it is going to eventually get logged. Getting logged is one part. The problem is that if any of these implementations have any sort of authentication or authorization issues, then it becomes a big, you know, nightmare because now you can actually take that request, extract that request from your proxy and actually replay it on behalf of the user. Right. Similarly, you have, you know, load seven layer seven or application load balances these days with all, you know, advent of cloud web application firewalls. These are very interesting devices. If your organization and if your product that is consumed by different people on the Internet is under any kind of attack. And if you have of an application firewall, then you will always be able to see what the attacker is actually doing. You will be able to go through the packets, you will be able to know what the attack is and thereby you will be able to answer questions like, you know, when the attack started, or you know, what was the vulnerability that was being exploited, what was accessed, what was exfiltrated or what was touched upon. So these are the kind of questions that you will always be able to answer with, you know, the application firewall like devices and others is of course, you know, market capturing devices if, if your user or your customer base is compromised through phishing campaigns, they are actually submitting their credentials to, you know, the people, the adversity here in this case. But if you consider an example like, you know, the British anyways hack that happened a couple of months ago, where somebody injected a malicious JavaScript on the page itself and it was only on the payment page. And what that script essentially did was that it logged every single keystroke of the user. So you as a customer you actually go to the website you are entering and trying to book a ticket through the platform. And while you enter the details, this piece of script that is embedded on the page is collecting all this information and sharing it with whoever is, you know, the attacker behind the scene. This kind of attacks were also very popular by the name of the whole crew that was behind this were termed as mage cards. And I think they're compromised a number of e-commerce websites through, you know, a vulnerability is like cross-site and thereby injecting the horse grips that allow you to harvest a lot of information as the user types this data onto these websites. All right. Let's, let's have a quick, you know, deep dive into masking, tokenizing, anonymizing encrypting. Now, I think from a data privacy point of view, I think these are very, very important terms. And I can tell you this sort of experience, right? So let's say that, you know, your organization wants to host a hackathon. And as a part of that hackathon, they want to share some sort of data with, you know, the people who participate in the hackathon. And the goal is for them to be able to come up with a creative app or a creative service or a creative feature on your product that can then be, you know, I would say then they will be able to, you know, either, what's the word, they can generate revenue out of, you know, these kinds of features and, you know, analytics that they capture from the website. And other similar situations could be that you have a data science team that is responsible for using different kind of machine learning algorithms to be able to give insights to your business teams to be able to make educated decisions of, you know, forecasted sales or forecasted inventory or, you know, if they potentially opt for a certain kind of service, what's the cost benefit that they would potentially get. Now these are some very, you know, various use cases. The mistake, the most common mistakes that I've seen in my experience is that they very easily interchange the definition of masking, tokenizing or anonymizing or encrypting the data. So for example, a lot I often talk to a lot of developers and for them encryption often seems to be, you know, the silver bullet that will solve all the problems. But that's not potentially true. Encryption does not solve all the problems, right. And most importantly encryption may not even be the requirement for that particular use case, but you're just trying to over engineer something and as a result make it complicated. Similarly, I've often seen people exchanging definitions of masking and tokenizing. I will give you some examples as we move along in the slides, right. Now, the most common errors that I have witnessed from data masking point of view is first is consistency. Now what consistency means is that let's say that you have a website and you also have a mobile app. In your mobile app, you decide not to mask a certain piece of data. Or let's say let me take a step back. You have one API that is also consumed on the web as well as on the app side. You decide to mask certain sensitive information. Like for example, if the website allows you to store credit card information, the website only shows the last four digits of the credit card number. Whereas if you go back to your mobile app on your mobile app for whatever reason, either you don't mask or you decide to mask the first four characters instead of the last four. Right. And these inconsistencies can introduce a lot of problem. The reason I say that is because let's say for example on Amazon. You've seen Amazon allows you to see the last four digits of your credit card number and Apple back in the day used to use the same last four digits to verify a customer. So you can understand that something that is sensitive for Amazon is not necessarily sensitive for Apple. And as a result, this inconsistent masking then creates a lot of other privacy issues maybe outside of a product or even within your product. Patterns. You want to use different character sets. You try to mask them on the client side. So when somebody does a right click and inspect element or a view source, you can actually see what the real text is. Unmask data that is data in log transactions. For example, in PCI, it asks you to log certain kind of financial transactions for audit purposes. And if the data is available in clear text, there's another issue from a privacy point of view. Now I have given some links which are very useful that we've used, you know, back into the place that I work where depending on whatever the technology stack is that will always have options for doing masking or, you know, implement custom field for my for my test or, you know, encrypt the columns that you consider them as sensitive. Tokenization. I would always question that whenever there is a business requirement that asks you to capture some sensitive sensitive information from our analytics point of view, these are genuine requirements for you to capture the data. Your approach should be very simple that when you want to rebuild the data set, you should always tokenize it. And when you only care about the data structure, you anonymize it. So for example, I spoke about the hackathon event. And let's say in the hackathon event, you don't need to tokenize it. You don't need to encrypt that information because it's pointless. But then you can anonymize it, which means that you can replace a passport number with a similar format, format it string, you could replace a pan card number with a similar format string. So that's how, you know, we've adopted these two simple rules, where if you want to rebuild the database, we tokenize it. If you don't care, and if you care about the data structure of the attribute, then you only simply anonymize it. Last but not the least data encryption. We've spoken about this. This is basically data in transit, data at rest. So whatever you do essentially goes inside backups, data stores, data warehouse, data lakes, etc, etc. The most common mistakes that I've often seen, you know, developers do is they're actually trying to encrypt, they're actually encoding some piece of data and calling it encryption. They choose the incorrect cryptography tribe. So instead of using asymmetric, they end up using symmetric. And whenever they're expected to use symmetric, they end up using asymmetric. And these do have privacy implications. I've also seen, you know, with my experience that a lot of times, because all of this knowledge is always known on the internet, developers will end up writing their own encryption routine, which is something that I would never, never encourage any developer to do because encryption algorithms have complex mathematical properties that they meet in terms of trying to protect the integrity of the data as a result they become a standard. So if you ever come up with something that does not meet that same level of standard is not worth, you know, spending your time on it. The other bit from a privacy part of you that I've often seen as a challenge is, you know, you can't have 100% security and have 100% privacy, and I'm absolutely no convenience. So it's always a balanced call between what is right, what is wrong, and which is where you can draw a ground or a middle line saying, you know, this is where you would want to stop and not normalizing it and other, you know, stay with the denormalized part. So this is my last slide. Hopefully, yes. I mean, I want you guys to look at a very high level process that we think works very well for us. So for example, to the left, you are seeing an autonomous team, which is basically having representations from different parts of your organization. All of these people come together and they carry out an activity that we call as data cataloging or data classification, which means you've identified the data that is important from a privacy point of view and then you've also classified that data. And then eventually you feed all of this back into data governance framework. And you will see that the lines are, you know, both ways, arrows are both ways, which means that this is, you know, a bi-directional process. And then you see continuous assurance at the top, which essentially means that while all of this is always going to be a moving target, continuous assurance is just like a feedback loop that you often are doing the right audits or you're trying to raise or identify different blind spots through the life of the data itself. Lessons learned, this is really, really my last slide. Catalog all your data, it's very important. You cannot protect anything that you don't know. Discover privacy related data in your existing systems. This is a very, very important activity and there are a lot of tools available on the internet. Something that I personally use is NASA's to be able to identify systems data. Don't store the data that you don't need. Once you've cataloged all your data and once you've created a framework, define technical controls to protect sensitive data. And the continuous assurance that you saw here is basically summarized in this line, which is build, measure, learn, feedback and loop. And from in a privacy world, it's very, very important that you trust, but at the same time you also verify. And the only way to verify is to have continuous assurance. All right, guys. So that's pretty much it that I had to share with you guys today. So thank you for the opportunity to discuss topics, article on privacy and unstructured data. I think it's a great starting point. And Hasgeet is doing the right thing by actually reaching out to community members like Scribble Data to be able to comment on the article. This requires wider conversation. And I want to bring one dimension which has, which provides the context for the whole topics article. And that is on the economics of the privacy, because the question is, why should the companies care about all of these, right? And I want to first mention two, three points about what is the nature of the problem that we are dealing with and why topics article and approaches that we discussed actually makes sense. So the first point that I want to make here is that the cost of generating data and sticking it in email or S3 and so on is almost zero. The cost of discovering, managing, cleaning, organizing all of this information is very high. So every organization is now dealing with this cost asymmetry in the world and this problem is only going to get worse and worse unless you apply ideas, approaches, principles to deal with the problem. The second thing, the second point about this particular problem space is that it's actually a long tail of problems. Even something as simple as anonymization is very data dependent, very context dependent, use case dependent and skill level dependent. It is not a simple decision to make and that is inherent in the nature of this space, which is that it's a long tail of complicated problems which can absorb any amount of FTE that you can throw at it. Simple techniques are not going to work, right? So you're looking at both investing in tooling, processes, education, applications and so on. The third dimension of economics that I want to highlight here is that the cost of cleaning up any bit of data only increases dramatically with time. It is that is inherent in the nature because you start losing a lot of context and a lot of work needs to just is required to just rebuild the context so that you know what is the appropriate action to take in any given situation. So in general what we are dealing with is very tough economics and today, then as we go through our organizational processes every day, we tend to externalize a lot of this cost. We just explore data and send it over email to somebody without realizing that the organization has to pay the privacy costs one way or other in the form of leak data or in the form of compliance sometime later on. So we have to figure out a way to fundamentally alter this economics, the trajectory on which the privacy economics is so that it becomes manageable in the organization and there are two, three base of doing it. First is by providing the right kind of tooling. For example, if the person is actually exporting data and sending over email. There is a need for a sharing interface somewhere, maybe constructing of cohort and sharing maybe the way to do this is to create a small application which eliminates the need to share. That is the first problem or have a very structured exporting interfaces which makes sure that the data that is going out is actually kosher. Interestingly, feature store was initially meant for machine learning applications. Increasingly what we are finding is that there are privacy applications. Also, as it became clearer, we became more and more privacy conscious. One of the big reasons they want to put a feature store between the raw data and the end users so that the data set that is being consumed is actually kosher from multiple angles from a privacy risk and other angles. The second point that we're that I want to make is that we eventually have to acknowledge and explicitly manage this cost. And here, initially I used to think of SOC to GDPR all of them as a burden, but today what the way I see them is that they're actually a forcing function. They force you to ask tough questions that you have always been postponing and they force you to have an at an individual accountability, right, because ultimately the CEO is signing off ultimately the CISO is signing off on all this. That means that now there is an individual anytime you have a workflow this individual will actually ask questions about why you're doing and whether there are better ways of doing it. There's a way to externalize the order to to surface the hidden costs in your day to day data processes. This is a longer conversation. I think I would love to see expansion of the article by coffee there are several threads that could be developed not just on the technical mechanisms like the tokenization and the API security and so on, but also on the other elements like catalog discovery governance flows and things like that. And I think the big thing that I am, I am looking for is that while in larger organizations that are specific teams to deal with all of these. I'm looking for ideas for how smaller companies can cope with this problem space, which much less resource need through clever thinking through planning ahead of time, having the discipline into and discipline in their tools as well as processes So looking forward to more from topic as well as this particular thread within has cake itself. Thanks for the opportunity to discuss the article.