 Hi folks, welcome to this presentation on data anonymization and offline data lake. So, my name is Pratap, so I work as a platform. So, one of the other colleagues will be accompanying me as our server is also a platform. So, together what we'll do is we'll be walking you through the whole anonymization process. We'll work various good practices that we've been taking away from this presentation. So, the agenda is going to be as such. One, we are going to talk about the GDPR complaint access. The second, we are also going to give just a glimpse of how great the data ecosystem is doing then in order for people to be able to appreciate the problem at scale. Second, the next thing we are going to introduce you to is some of the tech stacks that we have leveraged, which is Dali, Data Hub, and of course Apache Goblin as well. So, we'll be walking you through some processes around pre-time office station, right-time office station. And if time permits, we'll talk about the differential privacy. We'll also talk about some of the challenges and some of the monitoring aspects that you guys can probably leverage in your projects. So, with that, GDPR is a term that has been coined somewhere around 27-28 in. The general data protection regulation gives more control to the users, especially of EU origin. So, there are various guidelines and the other rules that have been imposed by GDPR with respect to the data privacy. And anyone who is not any unimposed privacy is a company who is going to have reached those privacy standards by paying a fine of 4% of their administrative finance. So, the reason why we are just emphasizing, I mean, we are emphasizing about GDPR, but there are other complaints and forcements also that will be covered as part of this design, which includes a CCPA, a Bitcoin IDPC and any intra-country policies that might be applicable on top of your data could also be covered. And it shows that GDPR covers a whole subset of rules and policies. And the other policies like CCPA and IDPC are sort of a subset of GDPR. So, and hence we are emphasizing more on the GDPR app. So, the interpretation of PIA provided by the GDPR council is as such, which means any user, any record that in uniquely pinpoint to a user will be treated as a PIA in a nutshell. So, there should not be any data, data, any attributes in your data that can uniquely identify a customer or a user. So, anything as such would become as a PIA. Now, by the definition, quite a, could probably look small. It could be a pipeline or a six-line definition. But the whole, if you start looking at the PIA aspects, it's quite large. Some of the examples that we have laid out here have got to do with the city, the state, the reason, the device ID, the IP address from which a user logged in, a user email ID, any sort of logs that are being captured by the user or the user ID. So, these all become the PIA. But at Lincoln, we have nearly 50 plus PIA attributes that identify our data as PIA. And as in when we find out this sort of involvement, you know, make more PIA attributes. But what you're seeing here is just eight of those several PIA attributes that the user can make. Now, GDPR righty says that, you know, each user or user user, user user data has to be protected. And the core value of Lincoln and their Microsoft is very nicely resonates with that, right? So what we believe in is also to respect our members' data privacy. It's important for us to maintain their trust, right? The second thing also, we also want the users to be aware or cognizant about the sort of data that we are collecting and how we use it. And Lincoln very strongly believes in his mission statement. And also if you observe our core values, our member comes first and every engineer at Lincoln acts the same owner. So in conjunction with just in conjunction with the GDPR guidelines, which also aligns with our core values. And what we are going to see is a reflection of both the GDPR, the CCPA and our core values. So why anonymization, right? So one on one side, the actual articles that we do one of the GDPR dates at Nogesh needs to be sort of encryption that needs to be applied on top of your data, right? So that's that's one and the second thing we also want to have all tools and it is a place to ensure that we also protect our users' data either by encrypting the data address or encrypting the data in transit or, you know, always give encrypted to the end customers or the users of the data. And thereby also reducing the probability of any data breaches if at all they tend to occur or if there are any errors that happens, that almost nullifies the probability of a data breach. And as we solve the anonymization problem and I'm sure other companies are also acting the same in there. What we are going to speak in this presentation are some of the good practices that can be followed. And we'll walk you through some of those prerequisites for following those best practices and the best practices. So in order for us to be able to understand and the diversity of the data, we'll just give you a glimpse on how the data ecosystem at LinkedIn looks like. So our members data is everywhere, right? Like for example, you see your profile, your profile details are stored in some system, your profile picture, your career background, your job change history. So there are systems that are tracking the audit events and there are systems that are storing information from documents. There are systems that are in most key value stores that are hiding, persisting your data in some way. Now, the data is several heterogeneous sources of data are available at LinkedIn. Now, what we are going to focus in this presentation is just the Hadoop's angle to it, which is the offline angle. There could be other talks which cover the data anonymization and the source system. But in this talk, we are only going to emphasize the data anonymization part of the Hadoop ecosystem element. As you see here, so we have campaign management system, we have key value stores, we have documents stores, we have our own front-end services emitting a lot of events. We have batch load data or derived data also having been represented in this whole LinkedIn ecosystem. So eventually all of this data lands on Hadoop. Hadoop is a data lake is built on Hadoop, right? And data lake scale is several, data might scale and nodes are in the order of several thousands of nodes. So that's a scale that we are doing and the data footprint is huge and the data diversity is also very vast. So what's the problem with such a big ecosystem, right? One, you have a heterogeneous data. As you see, we saw structured data. We have logs data, we have events data, we have a document and the value stores data. Now, the data is heterogeneous. So we have a combination of all three, the variety of data, the volume of data, the velocity of data is what you can clearly experience from that previous diagram. This heterogeneous data is a way to eventually have to land on the Hadoop for analytics or AI needs or other data mining needs. Secondly, the data access as such is also different for different data sets. Now, data sets which have high confidential information or confidential information has absolutely zero access to anything. But having said that, how would we determine if the data set has high confidential information? So that's another. Now, for the same data set, you have different views. What we mean by that is a user can, there is a HDFS data set. Now you can have a high registration on top of a HDFS data set. One can have a managed high table for a given data set. There could be an RDB or a KUs in Spark, you could have an RDB for your data set. Presto also can also give you, can also sit on top of your HDFS and give, can also be used for your computational needs. So that gives, so there are different views of the same data. The semantic variability is also huge. Now, users can choose to use Spark for their computing. Big is another way, high is another way, map reducing is another way. So the semantic variability is also very diverse. There is no unique, there is no one hard and fast rule of accessing your data. The next thing is the customization number. Now, then there are data, then there are, there are legal aspects also that come into picture. Where in, for some legal needs, if you want to persist your data, for some legal needs, if you want to restrict data access or you just don't want to expose. You just don't want to, you know, remove any data, unless and until those legal applications are met. So those customizations can also be imposed on your data. And of course, needless to say, we have different type formats, companies choose to use Apple's, OSD's, Spark case, and there are heterogeneous type formats, right? And we also have the volume, sheer volume of data is also in order of setup. So with all these parameters to be considered when, in order to ensure that, you know, we have a complaint data. So all of these parameters needs to be at this, not a periscope, having the complaint analysis. So this is how the representation of heterogeneous. You have HDFS. It is at the sole discretion of the user to use the acceptable tools and utilities. So a company can have or has provisioned by events, Spark and some people won't. And the end users are free to use any of the extension. Now, with each of these text tags, the ways of, the ways that your readers have implemented, the way you started is an improvement. The way you access your data, right? It varies from system to system. So that's our one problem. Our other problem is, you know, once you provide these systems, once you, once the users are using this system, the other challenges could be that users could have hard-coded formats. Some users can choose to use, they can write their own custom UDFs to access a certain HDFS path or a high data set. Again, there could be very tight coupling with Avroes or ORCs or Spark case. And the data model evolutions are also very hard to handle when you have such things. Right, now that that's, it's a hard problem. Now, how do we really mitigate such a problem? So a problem can actually be solved. So this is a 10,000 preview. So one way of, or one way of solving this problem is, have a unified access player or a unified data access user on top of HDFS for that. You know, everyone, every application that is accessed from HDFS will have to go through this unified data. So that is a, that's one of the easier ways to solve. So if we could solve this problem of unified data access, then we can. It's like a major milestone or hurdle achieved in the whole complex story. So what helps us solve this new data unification layer or creating a unified data layer is the Dali, which is the data access that you can win. So this, so for more information on Dali, you can actually go through this engineering block which gives you more insight on what Dali does. So essentially what Dali is solving for us is, so we have heterogeneous input formats. We have different types of file formats and the table formats. There is no, so how can we unify all of this? Now, if you look at database management systems, this problem is solved through very great rates from now. If you take any DBMS system, now how the partitioning is done, how the data is laid on your best, how the statistics are calculated, where the stats are stored, how the data is laid out on the best, what data structures are used to access your data is sort of abstracted if you look at it now. So all that the end user is cognizant about is their existing database and their existing table and all the low level details are abstracted from the end user. So the end user always gets a view of the Dali and the table as a row and a column format. And that's exactly what Dali is trying to abstract or unify for us. So Dali is Dali tooling and Dali readers to gather what this all is. If there is a high state as the end user is wanting to access the data set using Spark or using or using fake or using hide or any other scripting like this. What Dali does is the Dali has mapping for your underneath hide table or your STF as data set. End user queries your Dali data set, which is done via Dali tooling queries under STF as data set. Now the Dali layer actually plays a role of abstraction layer. Now, if there were to cheat if the organization were to work from one file format to another, let's say from average to worse. Now the end users need not do is need not handle this file format, then the file formats and then because Dali has already abstracted that for you all you need to do is leverage Dali readers and Dali is intelligent enough to get used on what sort of readers for you. Likewise, if your data were to be partitioned or if you were to migrate into a new table format. So Dali will actually be enhanced to accommodate those new table formats and the end user will not be exposed to any of those changes happening in the system. So this, this is how Dali solves that problem. So as you have this, there is a Dali layer on top of this. Any of the users who wants to apply the transformation or read the data will have to go through this Dali layer and which actually is it's on top of this. Now in a nutshell, if you see here, all we have to do is we have to inject a library at this Dali layer and the the compliance can actually be achieved just by in the compliance like into the Dali. What happens if when a user actually queries, this has given us data that we are Dali. This data actually passes through this compliance library and the outcome of this is a compliant data which means the 50 plus attributes that we spoke about in the beginning when we were defining PIA. All of those attributes, all those attributes that that associate associate to one of those PIA types will automatically get an anonymous and the user gets only a encrypted version of the data and your data which is the raw data which is encrypted at rest and is not exposed to the user. So what this compliance library can essentially do is either it can null out or off the scale or filter out any of those compliance data that is there in your database. The intelligence of how it is done is what is what is important in your compliance library. So this is one of the strategy of achieving and Dali actually help us solve that problem. The next tech stack that helps us solve the problem of compliance or anonymization is data. So data hub is a V2 version of their house that was open source a few years back, this is pre GDPR. So even before GDPR we had the data lineage in base and has been open source as a data stream. So when GDPR and other compliance policies of the regulations came in. So we were ready to actually enhance the warehouse suite with the means of annotating the data. So what essentially data hub does is for any given data set actually has your schema metadata. It unifies all sorts of diverse data sets and their schemas and they organize them into a common format. The micro level details that are stored in data hub is at a free level. So which means for any given data set you would know the lineage of the data set. You would know the annotations against the data set. You have the compliance type. If there if the if right for an issue if that needs to be imposed on top of the data set that information under my data is also available in data hub. So the data hub engineering blog, the link link blog covers more aspects of the data hub, but data hub data catalog is what helps us provide as the metadata that is required for us to determine. If a data set holds PIA on one layer data. So we have covered two aspects of it. So one we have covered the unification layer, which are the data access layer which sits on top of HDFS, which is the step one of nailing the compliance problem. And the second one is the data hub which acts as an input for providing that as the metadata or the whether if a data set holds PIA information. So as a conjunction with these two, we can solve them and how we are solving the complex problem. Alongside what we'll also see as leveraging goblin. So goblin is also a it's an Apache goblin project in incubation stage at the moment. So goblin acts as a data ignition pipeline, which very well. It's it's it's actually a backbone of LinkedIn goblin acts as an extraction transformation and load layer. It can read from various sources and it can write to various. We'll see how goblin impact junction with data hub and value also shows an organization. Now here is how the complaints transformations under the hood look like. If I mean so this this is a framework which is already provided by high and we are leveraging that framework. So this is one of the strategies. The strategy is being so there is an operator tree. So once a query is executed, there are very passes with generator query plan and operator. So that comes out as an outcome of your very past. Now what essentially happens is the table scan operator is the first that reads your data. If there is a filter operator that filters your data and then you have your column projection and the user sees the data. Now what we are doing here is adding the we have something called as the anonymization operator. Now this is where our intelligence lies on how we can actually anonymize it. The same table scan operators scans your data even before filters and transformations can occur on top of the data. This anonymization operator is taking this following inputs. Now what it does is the metadata that is provided by the data hub takes that as input. It takes the query context as input. It takes other privacy settings that might be available for data's level as input. And what comes out as an output of this is anonymized data. The data is encrypted. Now the recipe is wherein the means to anonymize the data is actually built into this anonymization of data. So also when this scan operator varies the status of this the valley layer is what takes care of you know ensuring it the one to one marketing. The data is taken care of here. So this anonymization operator is extracted from users. So it is actually deployed onto our masters. So what happens when a user queries a data it actually goes through this route it gets anonymized and they always get encrypted data. So this is one of the strategies. So some of the we have also seen diverse data sets we have seen diverse file formats we have structured unstructured data. We also have earns and composite ones at LinkedIn. So this is one of the standardized approaches that we follow as well. Along with the metadata that is provided to you via metadata catalog system. You also have metadata along with the data itself if it were informal. For example, if there were contracts being sourced from various channels. Let's say channel 1000 has a contract 100. So this is how it would look like. If you also consider a retail example customer buying items. So that can be an ally customer call and an ally item call item of the followed by the customer. So what the advantage that we get with this is even even without a data catalog. Even without the data catalog layer we know that you know the data we have the data and the metadata going. So this is also one of the better practices that has been placed for a very long time. We have spoke about the data catalog we spoke about the Dali layer. Now all we need to do is to use the data here the data have talks to the Dali layer speaks to the data layer. And yeah, so together we have a good harmony there. Now what problem does for us is this is how the content item works. So this is a problem framework. So goblin essentially breaks down your big chunk of work into smaller and achievable and smaller chunks of work. And each work is each part of this work and it deals with extraction transformation and load your data into destination cluster. Now why this slide is important is because one of the strategies in data anonymization is to create a copy of your copy of your primary data link. So if you have a primary data link which is which has zero access to your user community. Whereas you want to use the community only access the anonymized data. Now you can use tooling such as goblin to actually extract your data to your transformations at the time of extraction. And then you process your data into a secondary data link with your core data sets. So that that's one way of achieving that anonymization. Now the second way of achieving anonymization is a real-time anonymization, which means you have to in the primary approach the the decent one days of such an approach is the other cons of such an approach is whatever data you have on your primary cluster will have to be replicated from a secondary data link cluster, which means your data footprint is twice the size. So you need the same capacity as your primary cluster for your secondary cluster as well, at least for your core data center. Now how could we actually solve that problem through real-time anonymization? So one of the framework that could be leveraged again, or one of the strategies that has already been provided via high risk to leverage LLAB. So this is a long living process. This is provided by the high framework itself. So what this what Hadoop LLAB does for you is it can actually cash your data as well as cash your metadata and it can actually cash your encrypted data via LLAB. So what this means is each time a data set undergoes some transformation, you will have your data cashed and your associated metadata also cashed into this layer because LLAB is a long-running unit which is available on all your nodes. So with this process you can also achieve anonymization. The second approach is Dali data proxy approach. In this Dali data process, you can read more details on Dali data from our open source communication. So the linked blocks present blocks. So what Dali data proxy will do is, so Dali data is so all underneath HDFS accesses will be removed. If at all if a user has to read any data set, he will have to go through all the processes will have to go through this Dali data proxy route. Where in this Dali proxy will have the will proxy as a super user based on your authentication and authorization privileges. It will give you either anonymized view or a permanent view. So that's what the Dali reader proxy approach will take care of. So the control of your authentication authorization and the view that needs to be presented to a user at the time of the week is what will be decided by the Dali reader proxy. So this is the second approach that we were wanting to talk about. The third approach that we had was leverage Dali writer. So like we have achieved unification in reading all our data sets, heterogeneous data sets from heterogeneous text tags for the Dali reader. We are also a Dali writer is yet another approach. Where in the users can achieve an organization. Okay, so where in the logic of anonymization is embedded into the Dali writer itself. So, so that you know, whenever the data is persisted, the data is automatically anonymized and then persisted. If the user wants, they can also possess the data. Where in one copy, the Dali writer is copies or can handle one parallel right, which is anonymized version and second, which is an anonymized version. And the end users can only access the anonymized version by the free time that we just covered. Now, the other monitoring angle also we wanted to cover now. So we spoke, we discussed about the data space. We had different types of data sets. We have a variety of data sets, variety of data set formats. So if you observe what we have explained in this presentation is that, you know, every data set has to adhere to Dali, Dali file, that Dali format. Second is one of the prerequisites is, I mean, Dali is an answer. The first one is that annotation completeness. We want all data sets to have 100% annotation completeness. Without annotation completeness, the users will not be able to access data set. And if the only means of accessing a data set is via annotation. The second thing is once the annotations are in place, how we determine or monitor the data set is based on those four TPLs. Now the data stainless main, data stainless main, you have a raw copy, but your anonymized copy between your raw copy has actually been updated either because of the backfill or some annotations have changed for your data set. So such changes also have to be imposed or applied on top of your anonymized data lake. So we use this KPI format data stainless to keep track of how many such data sets are accessed. So RSL is as short as four R's to ensure all the data across all the cluster, all the data, which is user facing, have the freshness data, freshness in the programs. The data incompleteness is, we understand that no matter how much, how many rules we impose, probably because of a legacy, if there were any data sets we do not adhere to our data model. What we do is we give 1% tolerance for that data set, which means if your data set holds 1% of the records, which do not have a data type, which have an incompatible data type with your PIA, we drop those records. But if it is more than one person, the data set never is made available to take place. The next category is a missing KPI. What missing KPI means is your data set is available, your data set is available, but either the user have not provided their annotations or the users or your annotations do not go hand in hand with your data types. For example, you have an ID, which is an integer type, but someone said that it is an integer value, but the annotation was mentioned, which is off-screen. So these are incompatible. What we do is we do not even attempt anonymization against a data set, which means the users will not be able to access this data set unless they are corrected. So that is another KPI that we use to keep track of missing data sets and keep the users intimated on this. The second, the last one is the data locking. So we have a stringent data model review committee. We have stewards and shepherds who review the data, our data and their metadata. So our metadata are always checked and kept in. We always follow the check-in process for all the metadata. But in spite of that, we have our detection algorithms, which keeps crawling on the data to see if there are any data set annotations that do not go well with the data. So what we do is we immediately lock those data sets. We register the data set access for those. We create a ticket to the users. We would evolve the house security and only after the necessary annotations have been either corrected or my justification has probably been less open for the user consumption. So this is the monitoring story around the analysation. That's pretty much it. So in view of time, I think we will submit a blog on data on how we add noise to our data and stuff. So that can be submitted as part of the blog. But I think we are done with the presentation. And thanks, Waze. Thanks for your time. Thanks for the session. Okay. So that was a great session, not very informative. And we don't have Pratap today. So we have one question from Swapnit. Is it possible to implementation of Dali Hadoop in .NET or application SW development? Bhupendra, if you could answer that please. So is it possible to implementation of Dali Hadoop in .NET? Yeah, of course it is possible. But then I would say, so Dali provides an abstraction today over the Hadoop data like, right? And then on top of that, you can always add a layer which can be, I mean, which is independent of the language which your application is using, right? So on top of that, you can have a layer which is language and mastic. And I mean, you can have like an REST API. You can have like a trip protocol. You can have a REST protocol. And that layer can be, you know, your application can be in any language. And then Dali just works as is. So is it possible? Yeah, it's possible. But yeah, I mean, I would suggest to add one more layer and it is independent of, you know, what your application language is. Yeah.