 Satish KS and I head the engineering at a company called Ziotap. Prior to Ziotap, I worked in Ola camps. I was taking care of the data platform and the fraud prevention platform. I have 18 plus years of experience and I have spent the last eight years working in data and big data areas across various domains. The current one being by virtue of my current company, it is an ad tech and mark tech and prior to that it is an e-com and prior to that in a couple of IoT domains. Today I have a small presentation around my personal thoughts on the NPD draft or the non-personal data build draft which the government is or MEA is proposing and what I perceive as its impact on data business. A disclaimer here, this is not the views of my company or my organization, it is my personal views. So with that, just an intro on the NPD draft, I know a bunch of you in the audience would have already gone through the draft. First and foremost, if you look at the non-personal data build, that means a data set which is I would say stripped off its personal information or personally identified information that is defined as non-personal data build as per the Indian government which is under the track. So if you take a read at it, it has implications on the whole org. When I say the whole org, it's across the value chain starting from your engineering, product, infra teams, security teams as well as your legal teams as well as it could have an implication on the teams which is working on your pricing and the external business contract aspects as well. And the second thing which is very striking about the builders, if you think about it, it defines something called a threshold which doesn't have a number to it as of today. And based on that threshold, if any business is collecting data beyond that threshold, it automatically has to register themselves as a data business. Now, if you think about it with the explosion of the data which any online business or any tech company is seeing today, within one or two quarters, every business would sort of become a data business on its own. So that is another impact of the non-personal data build. And if they have to cater to the asks of democratizing and commoditizing data, which may not be part of the core business of what the business is operating on, that requires additional investments across the org again in terms of since it has implications or it has additional investments across the org. That is another aspect. And one of the foundational thoughts there in the NPD bill is it is striving to create something called a level playing field between small business, medium business as well as a big tech company. And while we are going through the presentation and then we are concluding, we'll just give some thought over it, whether it is really in the interest of say a startup or a small business, this particular bill, whether it is having adequate protections to help them as well. So that is the info on the non-production bill. Primer in any data processing in the data business, whether you are running data as a service or even if you're running as a say some other domain, if you're having data processing pipelines, what exactly happens there? So there are three pillars to this whole data aspect if you think about it. So the first pillar is the sourcing. So if you look at any first party, let me define what is first party. First party is the data the company is collecting about its own customers. A classy example could be Ola collecting customer info based on their rides and their registration and all these things. So that the sources of this data could be mirrored from say the web SDKs or SDKs apps as well as the walk-ins and the discount coupons and the subscriptions they get and all these things. So there are a bunch of online sources as well as offline sources for first party. And if you think from a third party business angle, there is a complete flip in the sense you will have specialist sourcing teams which actually scouts around on various aspects of the data, whether it is the right data, whether it is in the right format and what are the integrations required and what are the contractual requirements. And based on all these they'll be sourcing this data. And this data is what say some company B is collecting from some company A. Some company A is its first party data and that comes and this is termed as a third party data. And that is the sourcing aspect of the whole problem. And the second thing is the refining aspect. Refining, if you think about it, it is in there is enough literature as well as lots of data engineering talks around what is refining refinement of data. So in refinement of data you standardize, in the sense you standardize into a common taxonomy, you do mapping, you do transformation, you do cleansing of the data to get the beat out the wrong data points and get the clean data inside your system. And you will have a bunch of enrichments on top of it. Then you can apply heuristic or AI driven or ML driven intelligence addition on top of it. Then you have the quality makeovers whatever is needed when you are creating the statistical quality or it could be anomaly deduction and these bunch of quality requirements would be there. And the final point is you will be adding something called the temporality. Now temporality is nothing but the time dimension because any data which you collect almost 90 data, I wouldn't say all the data but almost 90% of the data you collect would lose value over period of time. So you have to add the temporality as well to the data. And from that you move on to creating consumable data sets which can be delivered to the end consumer. Now this delivery if you think about it, it is more or less the reverse of your sourcing problem where in sourcing you have the same kind of challenges in terms of format integrations. Delivery to have the same formats and integration challenges. As well as you will have say a push based mechanism or full based mechanism, batch mode mechanism, streaming mechanisms, AP based mechanisms or some customer would come and say I need a self-serve, a discoverable tool on top of your data set where I can go and define my own criteria, create my own data sets and download it for my consumptions. So these are all the various modes of delivery then other non-functional aspects come in terms of the security as well as the reliability. When I say reliability it is about the classic SLA management. For example, you could have given a contract saying monthly first week of the month I'll give you ensure the data dump is available to you. So in that case what happens if you don't honor the SLA. So these are all various aspects in terms of the non-functional aspects happening in terms of delivery. So if you think about it this is something very similar to what happens in the current oil industry and it's intended to be so. Data business more or less happens in these three major pillars in terms of sourcing then refining the data that is curating the data sets and giving it for delivery. So with that we'll see if you are either already a data business you will be having pipelines which are doing it. If you are in a different domain say some other ad tech domain or fintech domain you might be internally running some pipelines to solve some internal growth or revenue or fraud detection use cases. These particular similar structures or constructs would be available within your company for that. Start with something called the data catalog or the metadata derivation problem. If you think about it you might have 100 sources internal or external and each source will have its own metadata around how they collect the data and they each source will have various value types which they are giving into your system. So you need to have a catalog of all these. Now that is from the source catalog perspective what are the sources you're getting. That is what I'm mentioning as the raw data. Now once you refine the data you might add 10 or 11 columns to it in terms of I have refined it I have enriched it and these are the additional things I have figured out about the data set. Now that creates a complete bunch of data set in terms of the refined data catalog. Now when you're creating curating data sets that might be additional views and inferences and derivatives on top of this curated data set which you are giving for various end customers. Suppose you have 10 customers each of them may not be consuming the same set of data so each would be consuming their own set of data which will have some catalog implication with a say tag around who's consuming it and who's the source of it. You may not run into a major catalog issue unless you are doing a complete denormalization that is like a union all of all the dimensions of the data you're collecting across all the sources and making a single denormalized table for consumption which is practically not possible. So from the NPD angle what are the challenges here is it does not clearly mention it does not clearly mention what are the catalogs I need to make available for my end user whether I should give right from my source catalog the refined catalogs as well as well as my consuming data catalogs or I should invest exclusively on creating a new data set with its own metadata for NPD consumption alone right so this is one of the I would say a gap in terms of the NPD draft where it is not clearly defining what are the catalogs I need to maintain so that that becomes a I would say a question and which a data architect or a data engineering team has to answer down the line and and if you think about it there are other common challenges in terms of when when you do the this thing the data cataloging as well in terms of you can have cardinality management when I say cardinality you could have high cardinality data low cardinality data you will have your tags towards compliance towards sources towards consumption and a couple of other unique things would be the path where the data is residing there could be logical paths that could be physical paths where it is residing and where it flows through what are the version what is my current version of data what are the version which I have sent out for consumption right so these are all various metadata which you need to manage around the data and this all gets compounded if you have to manage it for the NPD scenario as well down the line and another thing is another challenge in terms of catalog is the toolset I would say as we are speaking it's actually fast maturing we have a bunch of tools cloud native tools as well as open source tools available for data cataloging and so the toolset maturity would be a concern but it is greatly getting address as we speak on this and the third thing is the cultural aspect right so if you think about this cataloging or creating asset inventory of your whole data sets for any organization it's typically on a back burner it is something classically what I would draw similarity in terms of being a tech person is where unit testing takes a backseat during development right so it will be unit testing would be doing a catch up so development of say 100% of code is there this will be say 40% then it will reach up to 50 60% so on and so forth similar challenges appear in the cataloging scenario as well so if you are have to adhere to the say the PDP or NPD or these kind of draft the this has to be given a first class citizenship in terms of your architecture and treatment and that has to be in place for you to adhere to or talk up to the requirements of NPD if you think about data processing first is many of the processing pipelines let's look at the life cycle of how a data processing pipeline is created typically you have the business team which comes and put a bunch of fast and it goes to the product team and the product team curates the ass and they decide based on existing data sets and new data sets which might be applicable whether we need to create a new processing pipeline whether we need to invest on that and what is the life cycle of the pipeline and what is the frequency all the all the bunch of things which you can think of from a productization perspective happens there now given the NPD states that I have to make publicly available all my data available that means I am making it discoverable via data catalog think about the scenario there might be 10 20 30 40 asks the saying that I need this part of your data I need this item on your data or this on your data right so various they can ask a subset they can ask aggregate they can ask a view of your data so all these things could come into this now who will curate it who will curate the ask and how does it get curated is one of the major this thing and that you could run into a scenario where you as a business don't need to run any aggregation or tertiary pipeline at all so now that would become a requirement by by say the external NPD based discovery and requirement coming into the picture and the thing is I you sometimes you may not even be protected from it because if the requirement runs into a conflict and if you go to the obertsman and they say okay you have to make this available that means you have to actually invest on this particular new pipeline processing as well right so that is one now the thing is now we are when you're processing this new pipelines new requirements for each of the additional consumers what happens is your metadata automatically increases right so it becomes a cycle now you have added you have a bunch of metadata based on this metadata new metadata is coming in because you're creating for that new consumer now that can go into another cycle right so so it creates a ending cycle of the metadata derivation problem as well and the third thing if you think about it in terms of processing is running proper algorithms in the sense even in the NPD draft there are a bunch of algorithms which they talk about k anonymity differential privacy homomorphic encryption and all these things which may not be required to be run for your core business whatever you are doing right a homomorphic encryption may not be needed but by virtue of ask from the based on the NPD from the external consumer you may need to force to be run to the run that algorithm right say if you want to run a differential privacy algorithm on a million dataset it might be possible suppose you are talking about a 1 billion dataset or 2 billion dataset there is a possibility it may not even be possible for you to run that algorithm so in these kind of cases where there is a genuine issue for you as a data provider that you won't be able to honor that ask it all what is your protection whom do you go to so this is some gaps which I see it is not addressed in the NPD like can the provider say no no it is not possible for me to give whatever be the price or anything yeah it's not within my reach to provide this particular data or cater to this ask even though my metadata has this particular from my metadata even though he has figured out this is possible but it is not possible for me to create that data set and give it to the person right so that is another thing so that is what I have mentioned in the challenges as based on the metadata do we have a control on what the ask might be right and if you look at any data processing pipeline whatever you put into your system be it in your first party systems or third party systems it comes with a couple of baggages non-functional baggages you might be running a reporting on top of it it could be a quality reporting or plain number reporting you could be having a alerting and monitoring on top of it right so this all these additional pipelines create some back pressure in those additional systems as well right that is that is another aspect and the third thing is which you have to think about it any pipeline will have some failures okay so the failure rate actually I mentioned as rate rate doesn't increase the number of failures more or less remain constant because the rate remains constant so for example if 10 pipelines there is a failure rate of one person and 100 pipelines having a failure rate of one person that means you have to manage 10 failures right for for the this thing by catering to these new requirements so that also throws up a challenge to the implementer in terms of how he is going to manage the whole processing life cycle as well so that is another area which needs thought when a person is trying to cater or when this bill comes into a frutition and it comes in into a full-fledged stage that should be protections around whatever we have been talking about as well as we need to figure out how all these are additional things are handled that as well this is an interesting aspect in the sense every data set has a quality aspect the quality can be in terms of absolute quantity of data or in the dimensions of the data or in the value set of the data so all these contribute to the quality aspect of the data and if you think about it if you are running the data within your system and within your consumer ecosystem consumer ecosystem should could be say 1 or 10 or 20 whatever so any issue in the quality the blast radius as it's called in classic security terms is contained within these systems now with npd what happens you give a data set that data set can be federated to another person that can go to another person now say layer 3 that is the quality issue because say in your ingestion a data person who has to give data on a snapshot basis for some reason he just dumped the whole data on one day that nuked couple of your versioning capabilities couple of your quality capabilities and everything and this is getting propagated to say one two layers down the line of your consumption and his algorithm breaks or whatever it created havoc on his system now what who takes liability of these issues what is the protection criteria and what are the legal measures to be taken here right so this is something to be thought about and if you have a quality issue if I have a customer you would have seen in a classic ecom website that is status page which is showing there is a quality issue here and we are rectifying it and this is the rca now that is the classic broadcast like you it's like think about it you are a hub and there is a bunch of spokes you are connected to you are able to clearly broadcast now think of your hub and it's a kind of a free model in which case how do you do the broadcasting of this issue and how do you stop the downstream layers from doing any harmful processing of the data right so that is the challenge which I am talking about so on the left hand side the problems it can cost it can create is it can create unnecessary cost of the overall ecosystem and we don't know again it's all comes to the question of liabilities of who's liable at this point all right and this is one aspect of propagation of quality and incoming quality issues having a system effect on the entire ecosystem second aspect is if you think about it anybody who's consuming the data they have to invest on something called some quality gates or veracity of the data suppose say you are collecting a particular attribute from three different data sources you can get conflicting attributes right now how do you figure out which of the three sources has given me the right attribute this itself is a challenge and if any startup is trying to solve this challenge they'll have a bunch of ways to figure out how to do that and this is not a simple investment this is a huge investment in terms of how do you create an incoming gate how you can protect yourself when you are getting the data so this is a two prong problem in terms of the producer side as well as the consumption side what they need to invest on right so I just put two challenges there one is a veracity another in the veracity I just told same entity different sources are coming with the same entity and conflicting data attributes what happens to that right so this is the other pillar which we need to think about when you are trying to implement something towards all these drafts NPD says that the security is governed by each vertical as well as say if you are doing NPD from the personal data that whatever laws are there in the PDP automatically applies to NPD but there are a bunch of problems there in terms of how it can mark one is the cascading effect right in terms of say one person is opted out and how you are going to audit the entire opt out across the downstream federated data as well there are some contractual classes but the thing is if something happens there how you can prevent it first and foremost is from a customer if you're thinking about it all these bills are trying to increase the customer trust now if something happens in three or four of your federated systems the trust is automatically broken so even though you have a contractual clause protecting your business saying that wherever leakage happens he is liable but the leakage and the source if they look back at the source the trust on the entire ecosystem takes a hit on this right so if the concern is not working well so this is something which is a technical challenge I wouldn't say it's a legal challenge but it is a technical challenge for people in terms of how it is and now there is the the ownership right so in terms of how the security ownership is transferred from each of this thing who's the data owner now that becomes an additional cataloging problem in terms of who's the owner at each point of time and who has given access permissions and who has the audit rights on whether the secure mechanisms are working or not and in terms of actual exchange if I think about the data as a service or data business what happens is you will have that specific business team sitting one on one with the consumption team and they'll be jotting it down saying I'll need three layers of security or two layers of security I need a IP sec I need a VPN on top of it and you need to encrypt my data with a public key or my secret key so all these nuances are captured so in this case who's going to capture the nuances and how it is going to happen and how many such contracts we are going to handle right so those are all various security aspects which we need to think into and other thing is but even the key management take a simple scenario are we going to create one keeper consumer and if any key which you're creating will be adhered by a key rotation policy update and rotation policy right so how this all pans out into the entire federated ecosystem this is another I would say this is another consideration where when you are implementing all these drafts or when you are thinking about how this will pan out into your ecosystem this has a major impact on how you're going to have that so it is going to have a major impact on the data security areas probably this is a very small point um see we saw the three pillars there right the sourcing refining as well as the distribution system the data distribution you as a business could be having a IP on the sourcing itself a classic example could be yahoo chats white chats or bloomberg which whose unique IP is access to a high quality very high quality finance data which they're able to sell to their customers second thing could be you are able to provide real good insights on top of your inferences right so that could be IP in your refinement layer or that could be IP in your delivery you are able to deliver to say 200 plus channels 300 plus channels or whatever and that delivery itself could be IP or you can have a combination of all these if you think about it from NPD perspective this IP is more or less becoming commoditized so in which case what would be your usp in terms of you running a business so how do they survive the whole survival of a data as a business itself comes into question if you're a small player especially and yahoo the another thing is when you're creating the usp obviously you will be putting some patents around it and what is the protection around the patent right so that is another another area operational efforts come in two things one is home house pipeline life cycle monitoring and the other thing is say you'll have multiple support system for your customer say you might have invested in zendes another root on sales force another person on jira another person on kiss flow any any any any systems which is having a ticket management and the support SLA management prioritization and all these things right now one is the support loading in terms of the number of consumer consumption increase second thing is how do you federate within this systems right so is there any thought process on how the support process has to work or if there is none the organization has to start investing on what is the support process they'll be able to provide to this additional consumers which this bill might come into will throw into picture then this is a very interesting area in terms of pricing if you think about any data pricing it is a very very nuanced problem and complicated problem you have a bunch of specialists who are putting day and night on that on figuring out what is the right pricing and generally the value perception of data is on a one-to-one basis say actor a is looking at your data and other actor b is looking at your data how a perceives the value of your data and different data sets you have versus how b is going to perceive your data and your data set is going to be completely different generally the pricing or on one-on-one basis okay second thing is it has other nuances in terms of whether you are going to go with a pay as you go model or a subscription model and if the subscription what is a tenure what are the commitments that right so these are all classic pricing problems now in NPD what it says is if the pricing is not agreed upon that is going to be a person who will arbiter between these two parties to figure out the pricing now that is a complete gray area in terms of how that is going to work and how specialized is this person to figure out the data across multiple verticals say it could be fintech adtech say plain product data or weather data or whatever so how he is going to arrive at the pricing and what are the criteria there and is there going to be any annexure or blueprint which is going to help in this in this whole pricing as a this thing right and the other things in the pricing contact is typically what you have in your SLA agreements if you will have some service credits in case SLA has a problem then in terms of you will have your own contracts with the data incoming data providers saying that I will have a share of this revenue how this revenue which you are generating from this has to go back to those contracts right so these are all additional aspects which you need to think into and second thing is when the receiver is going to resell your data what is the pricing there right so that that also might run into the it's a I would say it's like a dy by dx the second level of arbitration challenge about how that pricing is going to work so this is another thing which I just want to bring into table in terms of pricing it is it is a very very complex exercise in terms of data pricing and many data business or data as a service business run on low margin actually to be honest in terms of how they operate and this is going to be really really sensitive area when the bill comes into force or if it is not corrected upfront you have cost across the organization across the value chain if you think about it and what I have mentioned here a bunch of items is like what we talked about the data processing cost additional pipeline creation cost additional aggregation which we needed to create cost additional aggregation for inferences or derivations then you have the transfer cost okay that is something right so if you are doing transfer of data from your system there is always a transfer cost data transfer cost associated if you are sending a data set and if you have to if you're bound by SLA to send a daily data so you probably cost as well in terms of the data transfers and additional algorithm run has a cost as well right any any algorithm on a large data set beyond a particular capability is going to incur lots of cloud computing cost or any on-prem computing cost or however you are operating so if I have to spend all these additional cost what are my incentives right so I have to give this data and to give this data I am incurring additional cost so what are my incentives to give this additional data am I going to get a large incentive so that that is partly should be addressed by the pricing problem as well because the value can increase down the line say it one month the data value is so much and three months down the line it is so much so how is it going to be the incentivization going to happen your you as a company who is providing is going to be participant of that incentivization right so this is something to be thought about as a broad ecosystem as a whole and the other thing is if you think about it startups are already running into multiple challenges one of the we talked about it in the intro slide one of the foundational principles of the NPD tractors they want to create a level playing field where everybody is able to take data and achieve scale as soon as possible in taking data we figure out there are a bunch of challenges in terms of creating new pipelines in terms of verifying the quality putting the gates and all these and having the contractual and all these as well as for giving the data out we saw a bunch of processing cost new aggregations security considerations and all these things should think about it a startup might in a completely different business what they are trying to do and within three months they achieved a scale of data where they are forced by this draft to register themselves as a data business now they have to spend money and resource time people as well as actual rupees to get this right to get this done do they have the resources or big tech has the resources in my opinion probably the bigger companies or better place to either put in these new pipelines or procure new data and process them faster so if you think about it the equation is bit swayed towards a person who has a deeper pockets rather than a startup in terms of how it is probably in terms of ideation startup has a I would say advantage there but in terms of actual execution definitely they don't stand to gain from this whole thing so with that I just want to conclude my thought process here in terms of this thing so the top most item which you think about it which needs consideration from the NPD angle is the right pricing and some blueprint of framework of how the right pricing should be and how it has to have considerations across the verticals and across the data domain as well there are two aspects data domain is a horizontal aspect and the vertical is like your finance data or people centric data or say weather centric data or car centric data so that is the vertical centric item so how you're going to achieve it so this this might be a I'll say it is a hard problem to solve so this is something that needs to be thought about then the second is as I told you what is the protection for the data provider himself in terms of of course there are myriad of laws across say finance data now suppose a company is having user centric data as well as the finance data now what is the protection when they give out these data when they do some aggregation using both these data sets because they have user dimensions as well as the finance dimensions and the NPD is forcing them to give a dimension which is blending these two so how does what is the liability protection available for that person right so for that data provider of the source and the thing is if you think about PDP there is a usage right you need to get explicit consent for usage of a data from a customer now that usage would you be able to enforce across your ecosystem right we talked about the tree the federated tree how it is going out so what if some algorithm runs and tweaks the usage in terms of this thing and so what happens that right so the who has the honors of audit whether it's going to be a government authority or who has the honors of audit or should I spend my own rupees on going and auditing each time the whole ecosystem is using it the right way or not so that is another another another concern which has to be addressed there and if you think about it anonymization itself is a technical challenge if you look at latest technical literature and all these things truly anonymous data set is something very hard to achieve it is achievable within a contained ecosystem within your company but say if a person is collecting aggregate sets from your company and 10 other companies and he's able to add some intelligence with this to the sum of the parts you can get you may not be able to get the exact name and the identity of the individual but you still know or you have the digital persona of that person right so there is a problem there in anonymization itself so the whole non personal concept here non personal concept as well as data democratization in the sense all the data sets available from all the companies outside that sounds a bit oxymoronic to me in terms of how to achieve both together right and the last thing which we talked about in terms of the IP rights is uh if data is the usp of your organization data or its derivative or its inference is a usp or the ip of your organization pretty much i believe this commoditization which is which the npd is bringing about and scrutiny by public domain of all the assets which you are having it can pretty much nuke that ip or usp of yours so that is another thing how it can be addressed whether data as a business especially from a startup perspective probably a group tech company who has trillions and petabytes of data points can still survive and go ahead and do do new creation but if you are a data oriented startup which you're doing that whether you have viable business at all comes into question with this particular aspect