 Okay, so there's one comment from Pingali Venkat from on YouTube. He's saying one challenge I see is data goes through non-SQL channels, environment tends to be complex, so we need more mechanisms. Do you want to talk about that non-structured data bit? Yes, this is definitely a challenge, right? And I think Venkat is probably a more experienced person than I am in this stuff, but I think machine learning pipelines are pretty new, relatively, given the competitive data warehouse and databases, and there is a different beast and we definitely have to figure out how we can answer these same questions in the machine learning pipelines or the lifecycle that data goes through machine learning pipe cycles as well, right? So this is, again, this is a pretty new field, definitely need better solutions there and we definitely should discuss it, right? I personally don't have any experience with this stuff, right? Like I'm not an expert in machine learning pipelines. I guess we'll get Venkat to share in a second, he'll use this next time. Yeah, exactly. We should just get him over here and just have him talk about it. Anyone else who wants to bring anything? Devakna, do you want to come and step in and talk about some of the tools that you mentioned while we were discussing how to host these meetups? Sure, Srinivas. So just before I go there, I think I'll just sort of address the question that was posed just now on tracking the unstructured data. So my role largely is in that space being a data scientist and one of the things that we have actually started looking at is even when we are working with unstructured data, it is not that the that there is no kind of cadence around what data are we using for, say, a model building, right? So at any given point in time, you are creating certain snapshot of data. Now, when we talk about and Rajat was mentioning this earlier, right, data lineage becomes quite important in order to trace what kind of a life cycle is your data going through and what changes have been made, who has been using it, etc. So using a similar pattern to the unstructured databases or data sets, what one could actually do is look at what snapshots in time are you creating for a part of this data set. So for instance, if you're training a model, you are not perhaps using the entire data set, but rather a sample of it, right? And that's a sample becomes a snapshot of your data at any given point of time. So in terms of maintaining data lineage, would you would you version that sample somewhere for future reference? Model versioning code versioning data versioning is still is still something that a lot of machine learning engineers are actually dealing with these days. And it's sort of not a very mature field. But in some of the projects that I'm involved in, we are definitely looking at how do we bring in data lineage with respect to not just individual data, but data sets that are being used for model training and for other purpose. I don't know if that answers the question that was posed on YouTube. So I think another way to say it is that the questions are similar, but I think the way we find these answers is going to be different, right, between unstructured and structured data. Exactly, yeah. The foundations of it remains the same, right? So if you're talking about it in the context of data lineage, the foundation of data lineage still remains the same. The goals of data lineage still remains the same. It's rather the implementation of how would you implement, say, data lineage for unstructured, non-sequel databases? Right. I think Venkat says it's good enough answer. Anyone else here wants to raise anything, bring up your own stories of what you're doing within your setup. I think it's just open for anyone to just come up. It's a meetup style thing and we're just trying to learn from each other. This is the first session. So I'm going to be asking you to just come up with any questions that you have or just any stories, just feel free to chip in. So I think the other, something else that I've been involved in, but I didn't get into when I was, the talk is data obfuscation, right? So you have PII data. You don't want to give access to everyone, but you don't want to give access to some of the columns or some of the attributes of those datasets, but not to the sensitive columns, right? And how do you do that? There are a couple of ways you can hide the data. If you have a pretty sophisticated system, what you can do is you can hide the columns that have sensitive data to people who are not supposed to see it. Or you can throw a gibberish, right? You can hash it and throw it out so that they don't see what the actual data is and so on and so forth, right? If you're not sophisticated, then you pretty much have to make a copy of the sensitive data, remove all the sensitive data and make a complete copy of the table, right? And make a complete copy of all the tables that you want, right? It is surprising how many people take the latter approach, right? They use a big hammer to say, hey, now I'm going to make a copy of all my tables and just remove the dataset, right? And this had so much operational overhead, so much, also increases costs, right? Because you have suddenly have another copy of beta bytes of data that you want, right? That's another big, big space, which, like, you know, which we need to have, which we need to figure out good solutions for. Rajat, I think Venkat raised an interesting point on YouTube saying that how do we figure out the economics of doing this, considering it might require a lot of skilling and who's going to pay for it and how can organizations, say, adopt this insight, like you worked on some of these organizations, right? Where data governance policies that were set up, how was this implemented on ground? Yeah, yeah, this, like, I can give my experience, right? But it will also be nice to hear from the other people out there who worked on these projects, right? So I think this whole area is considered a cost center, right? So this is, this is not something that creates value for you, but saves money for you, right? And the way it saves money is by not breaking laws or not by, like, or not by having the data breach, right? There are studies out there that, like, you know, put a number of about, like, you know, $3 million for data breach, right? So let's say a bank has a data breach, the cost of that is about a few million dollars, right? And the, like, there's a whole process about how they figure it out and so on and so forth, right? And the other big cost that you save is by not going full of the laws and regulations that are there in your region, right? Be it EU or in the US and maybe in the near future, even in India as well, right? So that, are you trying to not spend money, right? So it's considered a cost center. So you need to have the fear of laws or the fear, like all the repercussions of data breach to put these systems in place, right? That's at a high level. Like, you would love to hear what others have to say about this stuff as well, right? If their experience is similar or not? Like, I think this kind of came up in a close discussion that we had a couple of days ago. So I would just add to that. I think that's a great point, Rajat. And I think even in my experience, I've always seen this to be looked at as a cost center rather than anything else. And hence, it always sort of becomes an afterthought or rather a second priority. And however, I think I would say the blame is perhaps on us, on the data community to some extent, because the examples of how this is important and shouldn't be looked at as just a cost center, those examples are not available right now. Those concrete examples are missing from the whole discourse. So for instance, some of this, so the conversation around or rather the implications of having poor data governance feeds into something like, you know, your data quality and data quality has a direct implication on the products that are built on that data and the services that organization is providing on top of that data. And that's a real implication in terms of, you know, say, dollar values that is, you know, perhaps always the first consideration. And that narrative around how does data governance feed into poor data governance feed into poor data quality and poor downstream downstream systems is actually not complete. They're not enough examples. And hence, I feel like every time the conversation happens, it all becomes a matter of, you know, adhering to responsible data practices. It sort of becomes an individual effort, rather than something that the organization should look at as a first class citizen. Right. This is just like tech debt, right? Where until it hits, until your engineering team hits a wall, you don't realize how bad your tech debt is, right? Because it kind of creeps upon you. And I think data quality is similar, right? Until you start seeing bad results from your machine learning algorithms or from your reports, that's when you kind of wake up and say, hey, you know, where did, you know, how did we go so wrong? And then it's a rush to kind of get things fixed. I have Mohan on YouTube asking me when we talk about compliance requirements of GDPR and right to delete my data. It's very difficult to implement, especially in large enterprises, when you really don't know where data is being stored. Any pointers and recommendations or tools that one can use, even in commercial domain? That's how I want to take this question. Yeah, I can talk from my experience as well, but I'm assuming others have more experience. If you've worked on GDPR, if any of you worked on GDPR complex inside your firms, I think please step in. I mean, I can talk about it just as, like I personally haven't worked on, right to forget, though I have been part of discussions just because, you know, like I have deep experience on the internals of databases and how do you go about dealing data from Mahadoop Data Lake, right? So I think the first step is to be frank to get to know where your data is and kind of address that topic. Let's assume that someone's data is stored in different data systems. I think the hard part over here is that some databases allow you to delete and some databases don't even have the capability to delete, right? So if you have data sitting in MySQL, it's not very hard to, like if you know which tables the data is stored in, it's not like it's not that hard to kind of go and run a delete command and figure it out, right? Like and get it done, right? And the problem occurs when you're storing data in your data lake either in HDFS or something like cloud storage, which don't really give you good operations to go about deleting a single row, right? And it's pretty hard. You have to, like there are a few techniques that I have seen or I have discussed where, you know, what it comes down to is that you shouldn't process these people's roles in, like in subsequent marketing campaigns or add the data into your subsequent reports and analytics and machine learning and so on and so forth, right? So people just obfuscate the data. Like, you know, you put a delete marker there saying that, you know, this data should not be used for processing anymore, that's one way to do it. And then eventually, when you get an opportunity, kind of rewrite that whole data set to actually delete the loads once a month and so on and so forth, right? So there's like, I think just stepping back a bit, like, you know, there are easy, easy solutions for some databases, hard solutions for the database, and sometimes you have to use the hammer of a rewrite, right, to be able to delete it. But I think some companies are struggling even in the prior step, right, where you don't even know where the data is, right? And you have to go and use sticks like scanning and data lineage to be able to figure that out. Yeah, just adding to that as well. The other question would be, should we have such tools? Because it's a regulation, right? And there are certain criteria to adhere to these regulations and compliance. And everyone's data is different. Every organization's data, the way they're storing it, the way they're using it, the way they are managing it is quite different. So I haven't come across like sort of a single, you know, a single short solution for GDPR, even in the projects that I've been part of, or some of my colleagues have been part of, but definitely what helps is to break down what kind of compliance are you looking at. And that would largely also depend on whether you are looking at a Greenfield project or is it a legacy system? Employing GDPR compliance in legacy systems is much, much harder at least in our experience. And it is a multi-year journey. A lot of organizations which are working on legacy systems and making them compliant to GDPR are mostly looking at proving intent rather than a solution because it is that hard. It is much easier to sort of, you know, set up the right checklist, the right cadence when we are initiating Greenfield projects. So what typically I've seen helping is when we look at everything that GDPR covers, breaking that down into what are going to be the primary concerns for us, what are going to be the secondary concerns and what are going to be tertiary concerns, and then looking at what tools and technologies and methodologies exist around tackling each of those concerns. So open source system, open source world is quite rich in terms of looking at these specific concerns under GDPR and similar regulations and helping with that. So for instance, if data lineage is one of the concerns, then the open source ecosystem is quite rich in order to sort of provide you with tools and technologies to data lineage, data cataloging, etc. So it does require to take that first effort in creating the blueprint and then breaking it down into what you need solving and what is available for that right now. Also, it becomes easier to identify datasets when we have a single source of data source, but bigger companies, as you mentioned, since the data leaves one source and goes, percolates down to many different microservices, and they have their own ecosystem where then they use their data. And out of that, if it goes outside my boundary of the organization to something like a fintech world where we have to work with a lot of fintech partners outside that org, and to maintain the lineage of that data becomes extremely difficult. And for that, I believe one of the major things that a lot of companies apply is the user consent, which a user has to give consent to the organization that he is educated in a way that his data will be going outside that boundary as well. That becomes paramount. Otherwise, even the user may not know if the leak or the data which is getting outside that boundary is somewhere else. And my credentials or my PI data is leaked, he may not be knowing about it. So giving that consent and that education to the user also becomes very important. So a lot of this is kind of compliance requirement that actually happens when there are actually codified requirements. And because the say India's data production law is just at a draft stage, do you think even some of these practices can be codified? Like, for example, we may or may not have right to be forgotten. That's to be decided by the parliament. But classification of data, for example, is something which can be done right away, right? Because you know what is sensitive data and what's not sensitive data, because sensitive data is mostly personal data or business data for the industry. Do you think there are any guides that one could adopt while doing this, at least in terms of classifying data? So for at least in my experience, I've seen a few folks who have started late in the journey. It becomes very difficult to do discovery, data discovery. And I've been part of one such journey in my company where it was a year long program management exercise to figure out the data sources and not sources, the placement of data where exactly the data is lying and what kind of sensitive data is around the places. The security principles and the boundary will differ in every source, but then to get the entire discovery that was extremely difficult. So it became like an exercise to, you know, year long exercise for us to do that and post that to arrive at a world where, you know, we get compliant toward right to forget kind of a thing where a user with a click of a button, we have to delete data from all these places. It becomes extremely important to restructure a lot of these architectural decisions that we have made in the past and to, you know, commit a place where one system can understand all the inventory of these data sources and can send the signals to all these data sources. So it becomes a huge exercise. And the sooner the better is what I believe. Just to add on top of that, I think, you know, as you asked, the data classification can be qualified. I think it depends on the business kind of business you're running. And the only thing what we have is basically the guidelines, you know, like some of the thing is very common like this PII. I think all around the world, the definition is slightly different because, you know, because of the government regulation, but most of the part are common. The thing is basically the business, the identifying sensitive business data is the challenge. And that is there. I think, you know, somebody has to, I mean, at the central tool, we cannot, you know, codify that because, you know, even for the identifying PII, most of the tools fails to identify Indian or South Asian country data because they fail to identify phone number structures, right, stuff like this. I think that is going to be a challenge in my opinion. So also there's that. And then there's the other aspect, which is around whether, so what, what do we classify as sensitive, whether this is sensitive on the face of it, or could it be actually de-anonymized by joining up different datasets? And this was, I think, Rajit was sharing a great example earlier around the cab services in the Bay Area. So one of the things that I have seen working out, and this has been a very rare situation where it has happened, is when you, when we start looking at just beyond technical teams to classify data, to help classify data sensitive, private, and perfectly fair to share. I have seen that sometimes what was missing is, is the whole user research activity around it. For instance, we could look at a direct PII data, like phone numbers, addresses, names, etc, and classify that. However, different cultures have different contexts in terms of sensitivity. So for instance, if we were to look at North America, perhaps the, not just the exact address, but just the locality itself could go, could give away which economic background you belong to, which racial background you belong to. It will not be the same culture context that will be applicable in India, but you know, you might have other considerations over here. So for instance, even if you have your surnames, and your surnames will not give your actual identity, but they might just sort of become a very strong proxy for something like religion, which is sensitive. So there is a user research angle to it, there's a cultural research angle to it that often goes missing, and it becomes a purely technical exercise, which actually ends up making it hard as well. So in one instance, in one instance, I have actually seen that bringing in that aspect actually really helps to make this exercise much more better and much more efficient. I think we should do one of those exercises sometime in one of the meetups. Yeah, that's a great idea. Yeah, I think this is really a new area, I think, over the wrong animation. I think not many companies follow that. I think that would something will be, you know, easing the classification exercise for this specific business. So at some level, a lot of these designs need to be included when you are initiating the project cycle itself. And most of the people who have been advocating this was primarily the security folks, right? Whenever you had a security team, they always used to consider this part of the life cycle, I think now you need to additionally have compliance team who are also going to be looking at all of this since the design cycle of the product starts, of any new product that starts. But I don't see them emerging until the compliance requirement is made mandatory by a regulator. Let's just hope that companies realize this and probably work on it. It's better to do it from the start than retrofitting it after your product is up because it increases your costs. Yes. And actually there's another problem to that. In my experience, wherever I've seen data governance committees and Steve Watts being appointed, they are given responsibilities, but what often is missing is what kind of accountability do they have? So if something were to go wrong, who would be accountable? Because I think dealing with the money loss question is probably the easiest one. We have seen enough examples wherein there are major data reaches that have happened, but those organizations are still sustaining and thriving. Maybe it was a couple of millions of dollars losses, but then it didn't shut down the entire organization. However, who gets affected is the user. And if something were to be leaked out, they are the ones who get impacted, but then there's no accountability per se on who was responsible for it. So I think one of the need of the hour is to also look at data governance not just as convenience, but actually to point it in almost like a legal structure. So if something were to go wrong, the people who are responsible for ensuring data governance within the organization have very clear responsibilities defined and very clear accountability defined around their role. We are almost at the end of the time. If anyone has any closing comments such as anyone else, let's just do it and we can end it with this session. I think the hope is that we have more of this. Yeah, that's my closing comment, right? So there was one last question from Venkat on what timelines are we looking for PDP. I think it's somewhere going to come end of the year, maybe if it gets if the parliament sits, but with COVID, there is no clarity on that front. But we're looking at least something which is going to come out or get operationalized in 2021, at least later part of the 2021. So companies probably got a year or two to look at this. Okay. Thank you all for joining us today. We are going to actually continue these conversations through a telegram group. We'll send you an invite through your emails, which you have registered. You can join the group and we can do this post the meetup as well. And we will have some focused discussions over every meetup. I don't know if we're going to do it once every month or say by weekly, but it will be depending upon the activity and how much people want to participate. I think budget can help you with that. Yeah, there's also poll, right? We'd love to know what other people want to hear as well. So we have a poll on how the event was happening. So let me just launch that. So yes, so the poll is up. Just let us know what do you think of this and so that we can improve these sessions and ensure that they are focused and suited for your requirements. Yeah, I'll send us feedback on Twitter or LinkedIn, right? For those people who are not on who don't have access to this poll, I'm guessing people on YouTube and so on. Yeah. So just let us know, even in the telegram group, we'll send you the links on what you feel and what is missing probably in the space or what needs to be recovered.