 This is Dupendra from LinkedIn and I'm the technical lead for the offline data compliance from LinkedIn Bangalore. Along with me today, my colleague, Somia, who is a senior developer for the data deletion in the compliance area will join me in today's discussion. So in today's presentation, we will talk about the various data deletion practices which we have implemented during the GDPR time and the reference architecture and the learnings related to that. So let's begin. Here is a quick agenda which we will cover today. So first we will define what is the right to raise your meaning and what it means for LinkedIn. Then we will also look at the overall LinkedIn's data ecosystem architecture and how the data is flowing from the online world to the offline world and the various other places. We will also touch upon the data scale part. We will also see what all the key technologies and the technology text that which we have used for the overall data deletion architecture. Then we will talk about the offline data deletion architecture and how all these technologies together is helping us to realize the overall right to raise your problem. We will touch upon the some of the interesting use case around the retail filtering and then we will talk about some of the key challenges and the monitoring aspects related to right to raise your. So what is right to raise your. So right to raise your in terms of the GDPR as a legal definition it gives and privilege for the user to say delete all my data. So it's like delete all my personal data with there without undue delay when it is no longer necessary or when the consent has been withdrawn. Right. So this is the legal definition when it comes to the engineering so that meaning we need an ability to delete some specific subset of the data or delete all the data when it is no longer needed. So for an example in terms of the LinkedIn one of the use case for right to raise your is when LinkedIn member closes the LinkedIn account, all the members related data has to be deleted within 30 days of the time. So that's the one of the use case which LinkedIn has defined and the promise which user LinkedIn is giving to all its member. So today's presentation we will talk with respect to that particular case. Now let's look at the overall LinkedIn data ecosystem so this is in reference image which shows how the data is flowing across various online services and offline services. So let me explain that. So when a LinkedIn member from his mobile app LinkedIn app or from the desktop like LinkedIn com is navigating to the LinkedIn website and the app and doing various sort of an activities. It may be changing to his own personal data it may be changing the preferences it may be sharing some content liking some content and many other activities which you can do on the LinkedIn platform. All these activities will internally generate a various amount of data which is captured by the online services. Now when the online services captures the data it has two kinds of various mechanisms to capture it in online stores where in the use of the online stores as an espresso which is an document store we have the MySQL Oracle capturing some of the members data here and then when the user is navigating across various pages doing a lot of activities all this tracking data is also generated in the form of the events. So we do leverage the Kafka technology here and all this tracking data is generated or emitted using the Kafka events. Then there are other set of users which are LinkedIn employees itself which are you know making use of this data and there are like a lot of internal applications which they are using and generating a lot of other meaningful information or the derived data on top of this data or various of the data. So that's like another mechanism and then there are third party services which will also generate the data and that will also come to the LinkedIn data ecosystem. So now if you see this picture when all this online data store gets ingested to the offline word where in this elephant denotes is denoting the Hadoop word basically. So all these database snapshots are getting ingested every day to the Hadoop and in terms of the storage we leverage the HDFS as in storage and on-prem technology. So we can see all the database snapshots every table will be present in the HDFS word and then based on the base snapshot there will be a lot of incremental changes like you like the LinkedIn members updating the preference deleting some records so all these things are also captured as an incremental dump or as an exchange data capture request and already on top of already ingested database snapshot all these updates deletes will be applied. So this is one mechanism where in online data is getting coming to the offline word. The other part where in all the activities the tracking related events which is generated in the form of the Kafka events are also getting ingested to the Hadoop word. We use the goblin as in the data ingestion platform to ingest all this data. Now apart from this in the offline word we also have the data from various other sources as I mentioned like a third party and there will be a lot of derived data. When I said derived data consider an example of LinkedIn recommending the jobs recommendation is there the people you may know is there. So all these jobs and these all these recommendations are provided based on the various analytics job which are running on top of this Hadoop data. Yeah. So now when all the all these jobs are run and then they will also have some meaningful data which is again stored in various other data groups. Some of them are like a Pino search the graph my SQL database and that's how finally when a LinkedIn member is navigating to the LinkedIn.com the user member will see the jobs recommendation and other data. Now when it when we talk about the data deletion perspective when a LinkedIn member is closing the account the data which is residing in the online store the data which is in motion or data which is in a near line pipeline or the data which is resting in the Hadoop world. All the data and even the derived data out of that right all this data has to be physically deleted within the 30 days. So that is the problem statement which we are referring today. Now in the offline world we have the HDFS as in our primary storage and then we are also moving to the Azure so an Azure storage in terms of the ADLS will also will be like another storage where the data will be stored and it has to be deleted. Now here. Yeah. Now let's go to the next slide. So when we see the data scale so LinkedIn has already 740 million members and it's growing very fast. So as I'm talking there will be like many more members already on boarded to the LinkedIn.com. Yeah. So now with this many members we are already at the exabyte scale data which we are processing today. And then at any point of time we see it's like a 2.3 trillion messages which is an estimate number that many messages are in motion. And I'm in that many records that many data records are getting processed at any point of time. We already have 10 plus Hadoop clusters. This includes the Hadoop cluster in on-prem and various data centers and this also and then we also have now the Azure data cluster where we are also lending some of the data. So if we see the data in LinkedIn is as the more and more our onboarding the data is also growing very fast and that makes the overall data deletion problem is a bit complex. Now let's look at some of the technology stack like what are the key technologies which we are using and realizing the data deletion problem. So we mostly we use the key open source technologies where in Goblin provides the ETL infrastructure for us data providing the metadata discovery part. Dali being the data section layer and as Kevin playing a key role of managing the flows and shooting the flows for us. Let's deep dive into all these technologies. Yeah. So Goblin. So Goblin is the recently graduated as in top level Apache project and Goblin is in the distributed data ingestion or data integration platform which provides the capability to ingest data from various heterogeneous data sources to to the you know you can be able to load it to the various heterogeneous destinations as well. So if you look at the Goblin architecture and the constructs which it provides. So this is these are the key constructs which are depicted in this picture and the very beginning the source is the main construct which defines from where the data has to be read. So your source can be a restless survey source can be an online database source can be another Hadoop cluster or it can be an Azure. Now after the source is defined the Goblin provides a way to to distribute the overall source into various tasks. So when I said task consider an example you need to process the data and it is across millions of files which you need to process. So you may choose to process every file as in a just as in a separate task. So that is what task meaning in terms of the Goblin and that is defined by a work unit. So now we have a source we have the various work units which which which will be executed in the form of a task in a distributed environment on top of the yarn which will provide you know the resource and the compute need and all these tasks will be executed parallel. Now if you look at within the task within the task Goblin provides four main constructs. So first part is the extractor. So as we have mentioned from the source now we have the work unit which is part of the entire source which we are reading in this particular task. So the extractor job is responsible for the reading the part of the work unit. Next comes the converter converter basically provides a transformation capabilities. So it is fully pluggable model which Goblin provides so you can actually plug everything extractor converter quality right everything is pluggable. So when I say converter user has the flexibility to plug their own transformation logic the filtration logic and even the schema conversion and many other conversion are possible with the converter construct. Next comes the quality constructs because as it's a data ingestion job so the data quality plays a very significant role. So there are a lot of data validation checks can be done at this step and it is again a pluggable stuff user can write their own data validation mechanism the schema validation mechanism and then you can decide either to write the data or you can filter out the data at the quality step itself. Final step is the right which will finally load the data to the destination the destination can be Hadoop or Azure or any other database. Yeah, so by saying that now from the source all the stars getting executed in parallel and finally at the data public step. So this is the step where Goblin provides the way to do the state management. So now at the end of the job you may have some watermark information or some state-related information so that the next time when you are again running the same ETL job on the same source you can resume from where you have left last time. Yeah, so that this is like in a watermark watermark can be in the form of a timestamp everything. So now this goblin oral architecture which is a distributed ETL architecture we do leverage this architecture for our data deletion practices. Yeah, so if you see how we are extending it for purge as I mentioned the converter step which is a pluggable step here. So we can write the custom converter here and we have the full flexibility to either filter out the data record, change the data record and then do the further processing. So in the subsequent slide we will talk about how we are leveraging this capabilities in our overall data deletion architecture. The next key technology is the data hub. So data hub, what is data hub? So data hub is the tool which provides the metadata discovery. So in LinkedIn all the data sets which is residing in online world which is in the online processing world or in the offline world the metadata of all the data set is present in the data hub. So this is one place where you know user can go and do the data discovery where is my data and find the related data sets and apart from that data have also keeps track of the relationship. So it also has information how the data is flowing. So when you know from the online world the data is getting landed to the offline cluster and from one cluster to another cluster and then various transformation is happening. So data hub keeps a track of all the data set lineage and how the transformation is happening. How one data or one particular column is getting transformed to various other columns and in which form. So this complete data set lineage is provided by the data hub. Not only that so data hub also provides in compliance annotations. So because we need to do the GDPR related data deletion so compliance annotations plays a key role there. So data hub also have the information about for every data set what is the PIA annotations. Meaning for every column how we can identify the PIA data or not. It also has the mechanism to keep the various purge policy in terms of how this data set has to be purged and particular purge key means for a particular data set whether it has to be what is the key based on that the row can be deleted. So the compliance annotation in data hub is then key information for enabling us for the data deletion with respect to GDPR. Next comes the Dali. Dali stands for data access layer at LinkedIn and as we know today there are various file formats, various storage technologies various table formats available. So for an example we have the Avro as in file format ORC part K and then we also have the data in CSB, DHT, XML and various other formats and then on top of that we have the high table format, we have the iceberg as in table format and various other table formats and then as I told we have the federated HDFS layer. We also have data stored in Azure. So there are various storage layers, various file formats and various table formats available and then for a user or the job which is reading and writing the data it becomes a hard problem if the job has to understand all kinds of formats. So Dali plays a very key role there which provides an abstraction for all the data read and write so a user need not worry about what is the underlying format and what is the table format where my data is stored. So whenever the data is getting processed, Dali that abstraction we do leverage Dali in our overall architecture as Kevin so it's again open source as Kevin technology which we are leveraging to manage our overall workflows. So when we when we say the data deletion there are various kind of jobs which are running some jobs are sequentially running some jobs are dependent on other jobs and then there are many flows which are which together will make the overall data deletion complete. So for that this flow scheduling and getting this flows executed in a distributed way as Kevin plays a key role there. So we use as Kevin for the flow scheduling the flow monitoring the flow alerts the SLA part and overall job management so as Kevin also provides the capability for you know enabling us for the real-time monitoring if a flow fails or if a particular data set fails it has a various alerts and events available so that runtime user has a flexibility to either resume the job or redo the entire data deletion so all this flexibility comes naturally with the as Kevin and then it is an key thing which we are using. So this picture is an offline deletion overall architecture and this will talk about how we are leveraging all these technologies together I let Somia to talk on this. Over to you Somia. Hello everyone. So we will check offline data deletion high level design now. Let's talk about all the inputs needed for offline purchase to work. The first one is a delete request so whenever a member ID is in their account we get a delete request in form of Kafka events which is then ingested into offline data lake by goblin job and then a Dali view is created on top of it. As we saw in earlier slide Dali view is nothing but an abstraction so that user cannot access the offline data lake data directly. So this Dali view will act as an input and we create a lookup table from it. Lookup table is nothing but the deleted IDs which we can access which we can have a quick access on given a few set of keys input keys. So this is one set of input for offline purchase. The next one is the actual data that needs to be scanned and then we need to make the data compliant rate. So we have data in Azure, we have data in HDFS so we have crawlers running over all the offline data lakes and we have data set summary every day with the latest partitions that are available in each of them. So these are the data sets which offline purchase will actually work on. So once offline purchase it has all these data sets it goes to the data hub and it gets all the compliance annotations which we talked about earlier. Let's talk about them in little more detail here. So here with compliance annotations data set owners actually tell us how they want their data to be how they want their data to be purged. For example, they can mark their purge policy as data restatement which means we'll actually filter out their rows or to column transformation and take out the confidential part. Or if let's say it's a test data or they don't need it after some days of creation they might actually want it's easier for them to have the whole data set deleted after 30 days. So the market is limited attention. In some specific cases they might want to do it themselves and the market is manually purged but then that is also audited and then made compliant. So this way they actually tell us how they want their data to be purged. So let's say they select data restatement then they have to tell us what each column contains so they do PIA annotations for each column there. So it tells us what ID each column is or a set of ID each column is. Then there is the third thing purge key. Here they tell us if they want a column transformation for the PIA field or they're okay with the whole row being deleted. So once we have these things we take the lookup table that we had created earlier and we compare the IDs and lookup tables in each of the records and then we remove them or we do column transformation as needed. But there might be few cases where the default logic will not work. For example let's say the confidential ID is very deep inside a column or let's say a column value is dependent on another column value. In those specific cases we help data set owners write custom logic which is then plugged into offline purger and then it runs as part of offline purger itself for that data set. Okay let's say the user doesn't want data restatement and they go with limited attention in this case we don't need lookup table and all it's very easy. We just go ahead and delete everything that is older than 30 years. For both of these pipelines and there are a few more manual purger pipelines also. All of these pipelines they emit Kafka events for monitoring purpose and a real-time dashboard is created showing status of each of the data sets all time. And if there is a failure if there is a leak that is detected appropriate action is taken. Okay so this was the hard deletion that we talked about. Let's talk about an interesting use case where hard deletion is not necessary. Let's say a user doesn't at LinkedIn provide an option to user to not have their data be used in ads targeting. So in this case we won't be actually deleting their data but the ads targeting pipeline will not be able to access it. So what we do is at run time in the Dali reader we inject our compliance library. So Dali readers comes in handy here again because it is acting as an abstraction layer and we can also use it to filter out data. The compliance library architecture here is exactly same as the previous one but instead of running it as a bad job we run it dynamically in real-time for that data set. We filter out the non-compliant rows and then we rows or column transformation based on the user requirement and we give the user the results. So this was one interesting use case. Next let's talk about some of the challenges that we face we have faced in past. The first one is which is very important comprehensive coverage of data. So in offline data it's a bit tricky because we have semi-structured data we have unstructured data many file formats and many table formats also like iceberg hide managing them. So it's very important that each of all of the data that is that resides there in some way it is compliant either it is related within 30 after 30 days or it is the restatement is done or manual purchasing it is done. So all of this is collated in a real-time dashboard which shows status of all the data sets. And if we find a leak again maybe if there is an annotation missing let's say or if there is a problem in any of the pipelines appropriate action is taken tickets are raised immediately. The second challenge we have faced is data scale. So currently we are at exabyte level already and we are doubling each year. So it's very difficult to maintain GDPR SLAs with ever-growing data. So what we have done is we have auto-scale model for resources but then it comes with you know big cost on compute but that's what we are doing here auto-scaling our resources. We have splitted jobs also sometimes in past but the auto-scaling has worked well for us. The third thing is purging of data set without schema. So with schema it is easier that like we have seen and like we can identify metadata and data have and it takes care of a lot of things but without schema it's very difficult for us because we have to pass entire files and then figure out the confidential data and do it but we do it for a specific data sets so that they are compliant. The fourth one is violations. So even with all of this in place we can have violations like for example if a confidential ID is very deep inside a column which is marked as log let's say and even data set owner is sometimes not you know aware so they have not annotated it properly. So for that we have few pipelines like sampling, classifier etc which randomly picks up data sets and then run classifiers on them so that it checks basically whether there is any confidential data in that. So these are some of the challenges we have faced there are more but like let's move on. So these are some of the references of the text tag that we talked about this goblin data hub that we use in our pipeline anybody can these are open source text tag anybody can use it to build a similar you know data deletion pipeline architecture. Thank you all that's it from my side thanks for joining me. That was a really great session thank you Bindran Somnow. We have one question from Swapnil he's asking when you're talking about LinkedIn data that should be deleted within 30 days which user data are you talking about so he's asking for some examples of the type of user data. Yeah Can you hear me? Yeah So okay so this particular use case example was the man a LinkedIn member closes the account all of the all of the LinkedIn I mean all of his his or her data right so in terms of in your profile you would have given some of the data and then as you are exploring doing search on the LinkedIn platform there might be some more data as part of the tracking events which would have been collected by LinkedIn platform right so when we say within 30 days the data has to be deleted that means all of this data has to be deleted from all the storage so in this particular today's talk we focus only on the offline only a subset of offline storage but then it has to be deleted from all over the places like the online near line or wherever this data is so it's all