 Good evening everyone. Thank you for inviting me to the session again and also thank you for taking your time to attend the same today session what I'm going to present is I'm going to take you through the blueprint on how Ziotap achieved GDPR compliance and compliance has been one of the core values of Ziotap given via handling telecom data and when the GDPR regulations kicked in, we saw it as a big opportunity for us to showcase evidence of the same. We took a product approach to the whole problem which treats compliance as a first class citizen across all its layers and my talk today is more leaning towards the product and pick aspects of things not on the cyber security controls and compliance process aspects. So with that, let me start sharing my screen. So let's start this time a formal introduction because last week I didn't do that much. My name is Satish KS. I head the engineering here in Ziotap and my responsibilities include all the engineering delivery strategy as well as aspects of technical architecture. I come with around 18 years of experience in IT and have been fortunate to spend last eight years in the big data stacks and I'm passionate about the tech evolution from both business as well as the core tech perspective. Another thing I thought I would mention in this forum is I'm also passionate about my art Tai Chi. That is my LinkedIn profile and introduction about Ziotap. Ziotap, we are a Berlin based startup. We started in 2015, 2014 was this thing, but operations started in 2015 and we are a customer intelligence platform which enables brands to better understand their customers. So how we do it is we provide three items. We provide a 360 view of the customer data. We provide identity resolution and in ATTEC there is a jargon called activation. We also enable activation. So this is a SaaS plus Das offering. We were originally a data as a service company because we were only dealing with third party data. But given the tech stack we built, we also repurposed it for a SaaS style of offering where brands can directly manage their first party data and optionally use the third party data as well for further enrichments. And every data company has a theme. We are people centric data collectors and the data can be categorized broadly as the identity data assets and profile data assets. And another style of categorization is deterministic, what actually we collect based on deterministic identifiers. And second thing is based on our patented algorithms inside the company we do some inferences and derivations. These we stamp it as probabilistic data which we supply for our customers as well. And as I told, we are fully privacy and GDPR compliant. We cater to close to 150 data partners, both inbound and ingress as well as agress. Currently the main stacks we are integrated with in any company on any layer is the ATTEC layer and the MARTEC layer. So that supports your path and GDPR by now it is no more tribal knowledge given the PDP bill and everything everybody is having a good level of awareness or a decent level of awareness but still I just jotted down a couple of points based on our early memories because this is a two year old law. It is a EU data protection and privacy bill. It became enforced in May 25th, 2018. And when it came, it had an effect across the whole organization, be it the business teams, the data sourcing team, product and engineering, legal as well as the security and IT team, every team had some impact because of this law coming into play in Europe. A couple of broad things it talks about is one is the personal data. When we talk about personal data, it means the data collected about any person which is identifiable to him and he is also a EU citizen. So that is called as the personal data. Second thing is the user rights. What are the rights the EU citizen has on the data which is collected by any organization? And what are the access rights? Access rights is both from the user perspective as well as the internal organization perspective. What are the access rights and security controls to be there? And it has some recommendations, broad level recommendations. It doesn't go to saying that this is the encryption algorithm you have to use or this is the technology you have to use. But it broadly recommends how you want to handle your data, how you want to transport your data, how you want to store your data, these kind of recommendations are given. And of course, there are audit requirements. With any compliance, any law and any regulation, there are always audit requirements. And the scariest of all, it carried heavy penalties, which could be a make or break for a startup. If you're not compliant, you can as well shut your shop and go away. So that is the kind of penalties it had. And it had two categories of companies which were handling data and ZOTAP fell into both categories. We were both a data processor as well as a data controller. We became a data controller because internally we were stitching the device IDs of the person's user with the profile. It may not be directly stitched together, but we had the inference mechanism and that is one of our patents which we had of how the telecom identities are stitched together to the profile of a person. So by that means we also became a data controller and we had to be compliant on both angles. Moving on, as I told you, we took a productish approach. So whenever these kind of regulations come in, first thing you will get from your security and compliance team is a huge excel sheet. We should ask you to categorize all the data sets, put everything in picture and all these things. Don't fall into the trap. Building a product is sometimes much faster than doing this operation like this. And that's exactly what we did in ZOTAP. So when we distilled, when we read through the GDPR from the legal lens as well as the business lens and the product lens, what we did is we distilled it into a bunch of product requirements. And that is what I have listed here. So one of the prime requirement of GDPR was how do I handle sensitive data? Sensitive data means data of a person's ethnicity, data of his health. His actual PI information, which I have categorized as another of this thing, how do we manage that? So these are commanders, sensitive data management. And second thing is it talked about by default, the consent is like opt out. You have to be explicit about the consent you are collecting. It shouldn't be like an implicit consent. You take a single consent for everything and you take it as a user constraint. Instead of that, it talked about explicit consent management. Then the third thing is how do you manage PIIS? They personally identify information like IP address, name, email and phone numbers. Fourth is how do you manage the user information? What is the product you are going to give out for? Managing is user information. These are a bunch of rights he has, like rights to be forgotten, rights to be erasure, rights to portability as well as right to understand what data he has and what are the processes we are doing. And access management, of course, access management, I'm not going to cover that much in detail in this presentation because this was moved totally to the IT and infra folks in terms of putting the access management as per the minimum privileged access policy, whatever is applicable. Then the auditing requirements came in as a product requirements and couple of other remaining requirements are like sometimes very specific to the companies. For example, ZOTAP has a use case where we create cohorts from the profile data, that is kind of a segmentation or audience creation from the data. So in this, we have some additional requirements to protect ourselves, saying that say a cohort should be of minimum 10k size. Beyond that, don't export the cohort somewhere outside. And what are the TTLs? Say for, if it's a cookie, what is a TTL I'll apply so that I don't have a very stale cookie still in my system. And if it's a MAID, what is a TTL I'll apply. So these kind of customer requirements always some companies would be having. So that we also had those requirements. It also came into the product. Then of course, the PIA management and from security perspective, they'll say, okay, if your data is addressed or if you're transporting data to this thing, use these levels of security. You need a layer four level security, you need a transport level security, and you also need data encryption. If it is encryption, this is the standards you used to use. So these all kind of rules come from the security teams as well. And of course, when you're becoming compliant, you need to scan through your current existing all data sets. So this one time requirement pretty much should be applicable to any company for a one time bootstrap or data cleanup as that is asked. So that also became as a product requirement, which was handled in a pretty much operation manner. After this product was built, we ran it initially on the all data sets and we kind of created this thing. It became like a testbed, internal testbed if you think about it for us to figure out whether what your build is working fine or not. So Satish, I didn't understand cohort. These are some derived attributes for every individual. What was the particular issue here? Well, the thing is, see the you have this collections created out of any data sets, right? So if you have multiple collections, that is a way you can kind of come back to the original identifier, even if you have pseudonymized or de-identified the person, isn't it? So what we do is the cohort, we put up this thing threshold saying that if say I'm creating an audience segment of say I have finder dimensions. It's actually what we have. We have close to 500 to 600 dimensions. Suppose a person is able to run a query putting all these filters and he's able to extract 10 records or say he's able to extract one record. What happens is that privacy is gone. He's able to take the exact person and the identifier, right? So it is like in from SQL language, it's a filter query with a var clause on the columns. But we want to ensure this cohort size, what comes out of it should be of a minimum size so that it cannot be misused for the identification purposes by cross combining with other data. But of course this is not a machine-learned or intelligent threshold we have kept. We have just kept a normal threshold. What we deemed as should be sufficient for us. Does that make sense? Yes, yes, understood. So Satish, the number of filters, the minimum you are putting or the different? I'm sorry. Your voice didn't come out clearly. Are you able to hear me now? Yes. Satish, yeah, okay. So I'm saying, can you say minimum size or the maximum size that you're looking at? No, no, minimum. No, no, it is minimum size because when I run the query, I'll get couple of records, right? So I'll get 100 records or 1000 records or 10,000 or 1 million records. So if the record size is below that minimum thing, we won't be allowing that. Oh, okay. It's mostly about anonymization or preventing or making it harder to reverse engineer. If I know that as soon as cohort 1, 3 and 25, I shouldn't be able to apply that filter in SQL and match it to two records. If it is 10,000 records or as the larger the group, the more safer you are as far as identifying an individual is concerned. Yeah, that's exactly the principle. The larger the group, it is difficult to reverse engineer to come back to the original identity of the person. Okay. So moving on, if we kind of split the product into a conceptual model, always compliance acts on a couple of entities, right? You have, those entities has to be given the first class citizenship across their architecture and that's exactly what we are. So if you look at it, the primary entities are here, the user and the data sets themselves. And in terms of processing, you need some kind of compliance processing, which can permit across all the layers of processing in your system across all your products as well. Then all these, whatever we are talking about in the previous slide, right? If I want to block a sensitive data, I want to validate a sensitive data drop rate, or if I have to keep a hashing encryption, if I have to validate the hash length, everything becomes a policy or a rule set. So that becomes another first class entity in your product. And another thing, based on the previous requirement, which is one is the consent, if it's opt out, you need deletion. Deletion is not simple in really large data sets, especially when you're storing across multiple data layers. You might have in GCS buckets, you might have in BigQuery, you might have in some other database. So the whole deletion processor workflow itself needs a focused handling, I would say. Same thing applies to TTL as well. And the top four, which I have mentioned is the consent audit user as well as the data set. So I'll just layer it as what are the logical entities here, what are the processing entities here, and what is the rule-based entities here. So based on this, we move ahead and started creating the tech architecture for that. The way I'm going to present the tech architecture is more of a bottom-up approach. So we'll see the various items which we actually put together and finally how they combine to achieve the necessary use cases. So the first thing, what you need and without which you should be really fearful to call yourself compliant is a clean data inventory, a data catalog and a lineage system as well. Lineage is something a bit specific for Zyota. We had a use case even before compliance to have a lineage, I'll just explain it. So you need to know who's giving your data, which partner is giving, what is the region it is coming from, what are the categories of data, whether he's giving me only identity information, whether he's giving me apps data or he's giving me interest-based data or he's giving me URL browsing data. So all these has to be there, the categorization has to be there. Then you also need to be aware of what this data contains. What is the schema or registered schema of a data partner? What are the field types, whether it's numerate or text or string or it is like some rejects. Then what is the cardinality? A zip code could be a very high cardinality item, whereas the age could be, can be bucketed and created into a five cardinality or four cardinality item. As well as gender could be just three cardinalities and what are the expected values or expected rejects and all these things. So this is another primary thing you need. And how do you describe the data? Whether it is a raw dataset, you have inferred it, it is derived by yourself, you have derived how it is derived. Then next thing is what is the version? Always when the data is flowing from system to system, downstream system might be acting on a completely different version from what the upstream system is currently handling. This is something the whole data flow should be aware of. What is the version in that time point what it is acting upon? So the version and timestamp of the dataset is very important. Then the last point is talking about the lineage. Why we needed lineages? As I mentioned, we collect data from multiple third party sources. So suppose there is user data, his interest data is coming from one data partner and say his income data is coming from another data partner. How do we attribute that? That means I need to know at each attribute level which data partner has contributed to that knowledge of that attribute. So this is something which we have and which came in handy when we had to do the opt-out management downstream as well. Then the second thing is another important thing is resolution of conflict. I can have partner A saying this particular entity is of age 20 whereas another data partner can say this particular entity is of age 70. 20 is compliant. 17 is a minor. I cannot technically hold the data. So I use this to give benefit of doubt to compliance and say, okay, I won't even process this entity. Let me drop the whole record because I'm in doubt. And this has other implications as well. There is a whole layer of data quality and I would say a patented goal set which kind of helps me in resolving the conflict, but it is kind of a priority queue. If you look at it, compliance takes the highest priority then the quality takes the next priority and so on and so forth. Okay. So these are a couple of items which we had to build, which somewhere built and some we were building at that time to achieve this. This is built over RDVMS and Elasticsearch stack and it has a microservice and library support. Library support is mainly for the Spark processing aspects. You will see these two points repeating again and again the microservice and library support. And when the catalog is updated as during the, say, when a data partner comes in, when this data sets getting on board it, is there any problem with my voice? No, it's not clearly, right? Okay. When it is onboarding as well as during the processing, right? When the data process is happening across the layers, you have some updates happening to the catalog as well. Like the classic thing is the data point, the knowledge attribution, which is the outcome of the processing. Now, when we were building this, this I'm talking about the end 2017 and early 2018, there were no cloud based native tools or tools like Apache Atlas, which can provide you this capability pretty much out of the box. Now, we also looking at using any of the specialist tool set and migrate to it so that the management and downstream involvement is much easier on all use aspects, right? And this has evolved to accommodate more items also like the quality stats and verifications as well. Yeah. So this is the first bit of technology in the whole architecture, which was built. Any questions there? Okay, we'll do one thing. We'll take the questions in the end. Let me run through the slides. Does that make sense? Okay. Yeah. We'll take the questions and take it in the end. Okay. The second entity is the policy management itself. Now, the policy is nothing but rule. If, say for this data partner, the schema, this is the registered schema. If more schema is coming, raise alert. If the email is not coming in hash format, hash length is less than 42. Then it is not a SHA for SHA hash drop that record or raise alert to the data sourcing thing. So these kind of things. One is the actual rules and another is the actions. Actions, we have a out of the box set of actions as well as it is extensible to create custom actions as well. Drop action, alert action, null actions and all these things. I'll just take you through some example of the next slides to help understand this. Okay. And this also has a hierarchical support in terms of the policy. If this policy is there, then this also has to be applied kind of hierarchical support is also there. And again, it comes with the same kind of tech stack, which I told in the previous slide text. One additional thing is if you think about it, the policies and actions are not defined by the engineers. It is defined by the domain experts. So we need to give crud or create, read, update and delete via API to power users. They could be your legal team. There could be your some account management team or that could be even your product team. So this is one additional thing over and above the catalog, which is being added catalog is more or less. It's like a self import thing. Okay. Fine. Just I just put a sample of flattened sample of how the policy table is going to look like in the RDDMS elastic search. I couldn't take out because it was coming out bigger than what I expected. Okay. The next thing you need to understand this concept a bit more, this has two use cases. Okay. One is I am separating the actual runtime parameters for the rules and the thresholds for the rules from the actual rule itself. This gives me say tomorrow for CCPA, if I have to change a different set of parameters for the runtime, I can put it against the CCPA loss. Say for PDP tomorrow something else is the threshold and runtime parameters. I can change it. So it is in database problems. This is called the normalization, which helps in more flexibility and evolutionary control as well. So what happens is this is a classic equation. The function of policy plus the function of the parameters which the compliance catalog provides gives you the action. So let's take two actions. For example, suppose say the policy is a schema policy and it is talking about age and the parameter age threshold is 18. And the action executed is drop the record drop action is executed. There is another policy saying that device IP address policy. If it is present in this data partner, that is a parameter. If it is, if it is present, it's just a Boolean parameter. It is present to absent present means replace it with null. You take the null action on top of it. And another major flexibility which we had because of separating this catalog is the data layer. Pretty much has many encoded parameters. When I say encoded parameters, the interest, for example, since we are in capping to ad tech and many ad tech data partners are there for this. They give interest in the IAB style coding, which is IAB underscore one, IAB underscore two, IAB underscore three. Like this, you could have custom jargons coming into your data platform. But that may not make sense for your back end as well as the unity user team. So you need to have some translation. So this catalog helps, though I have named it as a compliance catalog, it is little broader than the compliance catalog. So it has this blacklist, white list and all these translations so that the back end and UI can consume it as well. Yeah, makes sense. So these two slides, from these three slides, let's see how Spark processing pipeline is going to work like. Now, if you look at it, whatever I talked about these three slides, they all fall under in some way your data governance layer. I have a couple of other services which I have put here. I'll just take a couple of seconds to explain that config service is nothing but the actual Spark job properties and Spark config itself. Because for each of the workflow, each of the item which is being processed, it needs a separate set of properties. Those all are maintained by the config service. And the data catalog we just talked about. Then the policy store we just talked about. And path catalog is something interesting. This is for our trigger based mechanisms. So just to give you the problem statement there, all my data partners, they just put data into various GCP paths, GCS paths, the cloud buckets. And this path catalog is actually tracking them and it helps us in triggering the workflow at the right times. So once that workflow is triggered, it also knows next workflow as being triggered and registered as a path which is available. So it is kind of a metadata on the pipeline itself. So that is a path catalog. Then last one is the compliance catalog, which we again talked about. What is happening here? Say you have a data partner, you have a data set of the data partner coming in. First, using the governance mechanism, it will figure out what are the schema level policy. It will do the processing. It will take the actions and write the relevant audit logs. Then it will go into a loop where it will look into the value level policies. Value level policies, if you think about it, it has to be applied on record level or the role level. So for each record, if you look at the value level policies and take the actions, after that is done, you get a compliant data set. So this is how all these three together is getting actuated into an actual workflow. So like this, we have multiple pipelines, but what I'm trying to say is I'm just saying the sample of how one spark process is going to work here. Just one detailed question here. Does spark logs you pull from the containers and process? Yeah, into yarn. So it goes into the yarn logs and that is much more nuances that there is a kind of a pre-process code process and a post-process as well. And we have achieved a level of at least one's collection of the logs. So that is the way it is done. So from the spark, it goes to the yarn logs. We route it into that. And from that, there is agents running which will put it into Kafka. Okay, next comes the privacy opt-out and consent. So opt-out means the user is saying, just take me out of your system and there are nuanced opt-out as well as nuanced consent. You can opt-out and there is something called whatever purpose you're opt-outing out from. It could be a blanket opt-out or a purpose level opt-out. Same applies for giving consent as well. Blanket consent is given or purpose driven consent is given. ZOTAP has three modes of collecting consent. Given we are a data controller as well, we are obliged to create a privacy website as well as an app. So that is what I have mentioned as the ZOTAP consent that flows into my API, into my backend system and comes into the pipelines. Second is the data partner. When I say partner, who's giving me data into the system? Now this is again a multi-fold mechanism because some data partner gives us cloud transfer. Some gives us SFTP. So it is similar to ingestion. How I get my opt-out and the consent data. Also there is a SFTP real-time API where they hit when they are doing some syncing stuff as well. So all these mechanisms are available for the partner as well as the consumers. And based on how my consent is coming into the system, I have three modes of handling within this. So if it is ZOTAP, it is always global because that is directly coming to you as the user is saying, opt me out of your system, opt me out of ZOTAP. That means I have to go and nuke this guy or nuke this entity across my data sets and also wherever I have given downstream, I have to notify them to take him out of their systems as well because I have given the data point. Whereas when it comes to the data partner level, if you remember I talked about the lineage and all these things, what I will do is I will only nuke the knowledge which the data partner has specifically given me, be it the identities. If he has given me email and some profile data, I will take off those email and those data sets. If he has given me email and a couple of other identifiers and a couple of identifiers alone, I just take all those from my system. The third is the consumer level. Consumer level is even more interesting. When I say consumer, consumer is our channel partners. It could be Google, Facebook, Instagram and all these fellows. Once the user opts out of Google, Google comes and tells me this guy has opted out of my system. When I am sending this process data to him, at that time I have to filter him out. See this user, still I can send to the other channels because he hasn't opted out of the other channels. But he has opted out of Google. So I have to filter it from the Google system alone. So this is why we have three kinds of handling, the global partner level and consumer level. And that is all again run by a various processes. It is very difficult to put all the process slides here. So I am just giving you a verbal explanation, hope that is explaining things. If you have any doubts ask me after the session. And what happens is from this consent, I will show you in the next slide. There is a consent object which is a constructor which has the identities on which the consent has to be applied and what is the purpose of the consent. And the latest development in the past one year has been, we also became the TCF compliant. TCF is a transparency and consent framework which came from something called the International Advertising Bureau Consortium. Which helps in managing a consent at a blanket level across the cookies and the MAIDs, mobile identifiers. Because that is the identities primarily Zyotab works on. It is the browser cookies and the mobile identifiers. Next, so I just explain, yes. How big is the data set? Are we talking billions? Are we talking millions? Any data set I am talking about is what I am operating on is on billions. So just to give you a gist of data, I have close to 30 billion identifiers in my system. And 8 to 9 billion profiles. And each, the profile as I told you, the number of columns could go up to 500 to 600 as well. Okay, and it has some internal, I have nested column support as well. So it's kind of, we started as a data company, right? So we have, I would say grown a heavy data set and we have to manage that. So I'll probably brief about a couple of tech stats we are using also down the line. So this is how the consent pipeline looks like. As I told, I'm getting from data partner a probably the second thing I should have given us a data consumer B and Zyotab. And what is meant by this ID enriches? See if Zyotab is giving, right? Zyotab, we can get counsel from an email ID or I can get from a cookie or I can get from an MAID. So since I have to do a blanket nuke, what I have to do is I have to figure out what all the other IDs that is linked to this particular email. First thing is I have to hash the email ID because he's sending me an email to privacy.zyotab.com. I have to convert to hash. Then based on the hash, I have to enrich all the identifiers which are linked to the hash. Then I convert everything to the standard format. And same applies for data partner. What happens is if he gives ID, I have to figure out what are all the IDs, which is coming to the data partner, which the data partner has given me in the past. And I have to link them all together and create the consent object and the consent bags bags is nothing but the purposes which you're talking about. So you have a yes-no. And if it's yes-no, what are the purposes? And currently for no, we are not the supporting granular purposes. For a no, we just go ahead and nuke it because yeah, we just go and nuke it. We are not doing any further processing that. So that is why I mentioned as a deletion is one of the activity. And another is for the processing, whether it can be allowed to processing or not is given by the consent process delivery as a Boolean for any of the downstream processing. So this is about the consent data flow, sample data flow of how the consent looks like. It's not a sample actually. This is how the production also looks like more or less. Next comes the user management. As I told since we are a controller, this privacy website and privacy app, how we are managing. So the primary key here as I have been alluding to in the past 15, 20 minutes is like mainly the MAIDs. By MAIDs, what I mean is the mobile advertising IDs, which is the IDFA for Apple and advertisement ADID for Google. Then you have the cookies browser based cookies. You have the email hash, you have the phone hash. These are the four key or four identifiers which ZOTAP collect. We don't collect IP address as of today. We don't collect actual names of the person. We don't collect SSN and other things. So these are all couple of things in our blacklist, which we don't collect. For ZOTAP, these are the identifiers we work with. And we had to create an app. We have to create a mobile app and the website and they interact to the API, the backend API, which is based out of Java play framework. Something like a drop wizard and all these frameworks. And given the size of our data set, we heavily use blooms so that the identifiers are quickly checked whether it is present or not present. And all the identifiers currently stored in a DV call arrow spike, which is I would say a fast transactional OLTP transactional read rate database. So that is being used for all my identifiers for this particular purpose as well. Okay. The next item, we covered a bunch of the items in the product slide, which I showed as well as the comment slides. Next is, which is again very important as audit. So the audit, the decision we took is we'll take the logging itself as the audit. And we take all the logs and store it in buckets so that it can be loaded to OLAP database of your choice. It could be redshift or BigQuery or snowflake or plain five and presto and you do your analytics and analysis. But only key there is we didn't want to make unstructured rock. We wanted a common log across organization, especially for the GDPR compliance. So we extracted it out with the compliance logger, the library as well as a microservice. What happens is the log grammar includes the items which I've listed there, the violation type, which product is giving me that log. What is the data flow stage? What is the action taken? What is the timestamp of the action when it was taken? What is the timestamp when the violation actually occurred? And a couple of other metadata around the log as well. So this service is pretty much used across our layers so that across Spark as well as the backend layers and we aggregate the log into a single place. And it is stored on, I don't remember the bucket. I think it's month-wise bucket because compliance logger is not that heavy compared to say application logs. So currently if I remember it is month-wise buckets. And this can be lowered and you can analyze what all actions happened and what has, how many concerns has come in and all these kinds of things. You can do some basic analysis and forecasting. Yeah. Okay, putting it all together. So as I told compliance is a first-land citizen. This is a complete compliance service. It provides capabilities over the blacklist management, sensitive data management, the user data services, audit management. The compliance workflow is what is triggering the Spark workflows and determines which layer has to apply what policies and rulesets. And these are my ingestion pipelines on the left-hand side and these are the address pipelines. This is the only slide I had before, so which I used to showcase to say our outside deck. So I just reuse that. And it's a pretty old slide. I didn't rework on it. And if you look at it on the top layer is the actual apps ZOTAP provides for the opt out. And there is a user and the consent API layer, user API layer. And this admin layer is what I talked about for the policies and the catalog management, which is not exposed to this web app. So this is this admin layer is actually used only by the power users for the API based users. And it's just a representative from the various stores, which we talked about till now to achieve this, the whole use cases, right? So this is how if you look at it from a 50,000 PQ, how the architecture is going to look like for this, the whole product. Next is any architecture, you will have a couple of requirements, non-functional requirements like how it scales, how resilient it is, is it extensible and all these things. So this is from our own experience when we deployed this particular system in place, we had close to 80 workflows. And now we have more than 400 workflows and we are still scaling on a day-to-day basis. And all workflows has a certain set of process of how this has to be, how this is initiated in the Spark library and how it takes care of the things, right? Second thing is we also did a AWS to GCP migration and it was a decent, fairly simple lift and shift for us. This whole compliance angle as well, this process as well. Then second is when CCPA came into play, we had to, we could tweak the policies accordingly for the US and specifically to the CA, California region to make it extensible and accommodate those aspects as well. Of course, there were some changes, I don't remember everything off my head, but as far as I remember, we could accommodate everything without engineering effort. Just with the testing and sanity effort, we were able to cover that. So this is largely the pipeline summary of how the things have been deployed. Next item is infra and security validations. I thought I'll briefly touch up on this. ZOTAP, all the infra is split region wide. We are very, very careful about data sovereignty. Whatever data is in EU, it is always in EU, data storage as well as processing. Same applies for US and same applies for India as well. And the access rise controls, we have a couple of certifications which I'll touch up on. So it's, as I mentioned in the talk before, it is based on minimum privileged access across all the data sets. And we have a chief data officer as well as a security person who does quarterly audit on this and figures out if there are any exceptions and pass on recommendations which is taken up by the infra team. Then we don't mix the ID and profile except during runtime processing. Always the ID and the profile is pseudonymized. In the sense during ingestion itself, if I get an email ID or a cookie ID, what happens is a pseudonymized ID which we internally call it, call as a ZU ID or a ZOTAP unique ID is identified for each of the user profile entity. And at any point in time, if you want to do analysis of say general age bucket analysis or general interest bucket analysis or how much of the filth rate I have, you operate on this identifier which doesn't give the actual identifier of these sets. That is happening only during for the, that access is allowed only for the runtime applications. Then another security recommendation is the data at rest encryption. Pretty much all the data at rest read in GCS buckets or anything. It's encrypted. We also use BigQuery, which is again encrypted data at rest and the email and phone numbers be hash. This has some nuances in the sense. We support all three hashes. If data partner says I can use all three hashes, we go ahead and use MD5, SHA, SHA 256 as well, apply for phone number. And we use upper case, lower case and ignore case. So in all this hash combination itself is like a nine plus six 15 total combinations are available in our systems. It may not be always the same filth rate because of the contracts we enter into the data partner. So these are all a couple of other items I thought would be useful for this forum. So I presented this as well. And the hash, all are validated. We have every hash has a length parameter so you can easily validate as well as the email and phone number. There are rejects is against which you can validate. So these are all ideas which you can validate unless unlike a cookie, which can be any random UI. So whereas these identifiers, which are PIS, you have some levels of validations as well on top of it. GDPR doesn't recommend any certification out of the box, but if you have all these certifications, it kind of enhances your stance in saying that, okay, we can believe this company is compliant. So we have gone through ISA 2000, 27,001 certification for past three years. We did a re-certification this year as well. And same applies for the CSA Star Certificate from BSI. I don't remember this thing. It's some British Standard Institute or whatever. Then the ePrivacy Seal from the, I think it is from the EU forum, one of the forum. We also have that. I don't know what will be applicable for the PDP down the line if there is going to be some other ISO stack or whatever. But yeah, given all the stacks we have built, we are very confident that we could just swim the PDP bill through as well. And current additional developments which are happening is, if you look at it, when I was going through the slides, there were a couple of manual processes which can induce human errors, like when the data partner is onboarded, there could be some manual errors in terms of saying the schema has so and so items, whereas additional items comes in. Of course, there are some validations present there, but still there are human errors which can creep in. What we are looking at is current development is going on around NLPA ML based scanning of the data sets and application. Of course, there are major cost challenges of productionizing this as well. But this is some active development which is going on in terms of both for the data quality as well as the compliance aspects. This is the current development which is happening in the organization across the data layers. And yeah, that ends my slideshow. We can move on to the question and answer sessions. Hopefully, there were some useful pointers for everybody and some takeaways. I'll just stop the sharing unless anybody asks for anything. Thank you. Thanks, Satish. There are actually quite a few questions, so let me try to club some questions together. I think one question, let me start with the most recent question. This is from Tosif. He asks that since you collect data from user browsers directly, how do you deal with unstructured data like passport reset tokens which also get collected and can be used to uniquely identify a customer? Yeah, so currently we are not collecting passport from the third party perspective. So that is under the first party preview that is under development currently, the authenticated traffic solution. So once that is coming in, we don't have a current solution to be honest. We are looking at various aspects of how to manage that as well. So yeah, I know this forum is divided between say a first party data collector. Say for example, if you take any of our recon providers, they are a first party data provider. So being a third party aggregator, we have some contractual classes saying that I won't take this data from you at all. So that is what is protecting us. But of course, once we venture into the first party data, these are some challenges which we are going to face. Tosif, if you have a question further, you could either put it in the chat or you can feel free to unmute yourself and ask. I'll move on to the next question, which is likely related, which is from Naushad, which is on what does GDPR say about data which is scraped through social media for publicly available posts? Which is scraped through social media on publicly available? Post. Naushad, you want to directly ask your question because it might be open to interpretation also. Yeah, so the organization collects the data by scraping the social media post. What is the GDPR stand on that? Okay, so I'm not sure of the GDPR post, but if you look at it, what you ought to do is you have to be at the end of the day, you are acting as a third party data collector there. So from GDPR perspective, if you ask my opinion, you have to register yourself as a data processor and whatever is applicable for the data processor rules, right? You need to adhere to that is what my current take it, but it is better answered by your legal team as well if you're doing good processing. But from first look of things, if you ask me, you become a data processor there and you have to adhere to all the classes which are pertinent for data processing. Also here, one key question to be asked also is how this scraping is happening? Is it within knowledge of the social media app or is it without the knowledge? Because the moment it is without the knowledge, you are anyway going into the security ring. If it is within knowledge, then you would have some agreement with the social media app to be able to scrape that data. And there are some logistics around it. I think YouTube, Facebook, everybody has that kind of licensing agreements available for you to be able to scrape it. So it has to be done under license from the data controller? Yes. Publicly available, still one has to go through the licensing. So we had to look at this question recently for one of our customers. And when we looked at LinkedIn, there was a lawsuit against scraping LinkedIn without having a formal licensing agreement from LinkedIn. And it has survived through the US courts and today it is legal to scrape publicly available information from LinkedIn. But they make it extremely hard to do it. I mean, they're almost like quasi licensing by blocking you very aggressively and things like that. But it's legal. Now, Chef, does that answer your question? Yeah, so there could be a challenge in identifying the data subjects whether they are EU resident or outside expert. Yeah, that is definitely a challenge now, Chef. That is the first excise actually. You have, I would say a large excise in identifying your data assets and where it comes from. And these are all questions you need to ask to your tech and product team or if you have a specialized data sourcing team them as well. So from a security angle, yeah, this is a valid question from a compliance team angle. But without the answers to the previous questions, right, the data inventory questions, it's very difficult to arrive at what is the action you need to take on those. Thanks. Okay, so Sameer has a question. Sameer, can you ask your question directly? Yeah, hi, Zathis. So I was looking at some of the, there is a lot of talk going on around having metadata tagging to the core data to identify them as PR. Yeah, or you build into the system so that the tags tags are in some way the data is pre identified through configuration. So while you were describing the architecture, I did not come out clearly in terms of when you say I can identify or use the data catalog across. We're using a configuration for each data type or we're using any metadata along with it. So we use configuration as well as metadata. So let me let me give you an example. Okay. So if the lineage is coming right. So that is a data partner level tag, which is happening with the data itself. Okay. And if it's a derived attribute or what is the derivation going that is going as part of additional columns, the same data itself. Okay. And another thing is the actual. What is the other thing the metadata, right, so we have data as well. How do I put it to just give you an example, say, say for a data partner, this is a schema which is coming out. So that is coming as a metadata and that is useful, useful in posing the validation. I'm not sure if I got it. Yes, yes, I got it. Cool. Thanks. Yeah. Okay. So we have Now, actually, you can model all this in a graph database. Now Atlas is giving that, I would say abstraction on top of Janus graph directly. So you can manage both this tag based use cases as well as database use cases by using it intelligently. Yeah, that's there. But then what happens is, if you have not tagged all the backdated data, because you would have started configuration at some time, the challenge that a lot of product platforms are struggling with today is the overhaul in the design required to ensure that is enabled before graph can come in and do some Yeah, or build the outside config layer, which can then be used as a schema to map across wherever you find the data. Correct, correct, correct. So this is, this is I would say a one time activity is not even a reusable activity. Yes, yes. Yeah. Unfortunately, we have to go through that one time activity at least once. And in a way, if you look at it, right, you build all this stacks, and you take your data set and run through these, it, it might, it becomes a test as well for that one time activity, whether it is Absolutely. But that's great when it comes to the structured data, the unstructured is still a different challenge altogether. So data that is already outside the system is sort of becoming a challenge now for everybody. Correct, correct. Unstructured is always a challenge. The easiest way to tame it is have some versioning and stale data, just take it out of your system or put it in glacier. Don't use it for any active processing. Yeah. So one of the things, one of somebody was telling us that maybe it'll be easier to simply replace all the machines, not give them any access to prior data and see what they really require. Only generate data. So that could be a overhaul that is. But then I don't think something like that is visible is just probably a joke that was going on. Yeah. Okay. Cool. So we have two questions. I'm going to club together. One is from Rajat, which is about, I mean, he says that you partially answer the question, but he still wants to check. Did you consider open source projects or are you looking at them now? There are a few popular ones like and he's interested in gathering your inputs on these projects. And then there's one from Naman, which is about why did ZOTAP move from AWS to GCP? Was it more from cost and compliance? Well, you know, it has nothing to do with compliance per se. It was other business reasons. On Rajat's question, Rajat, we did a POC on various catalogs available. And within our lens, we saw Atlas to be one of the best out there. Because beyond the basic data catalog, right? If you remember, I had given some use case around the path cataloging and other items as well. In ZOTAP, given the number of pipelines, right? We have like 400 plus pipelines, that itself becomes a management issue. So to manage metadata around the things beyond the data's metadata as well, we found Atlas is much more extensible. Amazon is a very good product. You have to fit it with your use case. I would say run a time box POC. And even cloud providers have started coming out with data catalog. Google just went from beta some three, four months back. That is also a decent catalog. You can look at your use cases and cross position. One thing in all these big data systems is always run some kind of use case to your POC to assess whether it is working fine for you. Because everybody will climb. I'm doing everything, but it doesn't work that way. Rajat, your follow up question or does this answer your query? I think Rajat may have banned with issues. Venkat, I think you have the rest of the questions. So maybe it's just better if you ask them. Yeah, I got some of them already. Just one thing that I wanted to ask was the lineages at a column level. Yes, lineages at a column level. No, no, no, attribute level, sorry. Attribute. Yeah. Now, when you have new policies, so the policies change, what happens to already all the previously computed data sets? So if you look at Venkat, since I have a separation of the identity and profile, my identity is more or less transactional. My profile is more or less snapshots. It is processed and whatever ZID is at that point in time is there. It is attached to that. So what happens is if I nuke the identifiers in a transactional manner, that particular ZID and the profile is more or less stale for me. And in the subsequent processing, whenever it happens, it automatically dies a natural death. So even though I have that active data in my system for some time, it dies away in the subsequent processing. Okay. Essentially, by creating that one level of indirection, you are removing that mapping. Yes, exactly. The mapped IDs is transactional. The aggregated as well as the profile things which have the actual data is snapshot. You get it. That's the way it works.