 So, hey everyone, I'm Athef. I'm a data consultant from ThoughtWorks with more than 10 years of experience into building some of these distributed systems. And I've also been working in the big data space for the last five years outside of work. I have interest in security, open source and DevOps. This talk is primarily going to be what has been my experience in the industry with when it comes to data governance and strategies that a lot of big organizations take up and just talk about how they fare, what are the strategies that they take and what are the challenges that they face when it comes to data governance. So before we start, I'd just like to call out that data that we had earlier in the 2000s or the early 90s was very different. The landscape was that you typically would have a MySQL database or an Oracle database or one of these databases. And what you would typically do is you would have tables on those databases and the role of governance in that time was to have or be aware of all of the tables that exist in a single data warehouse or in multiple data warehouses along with all of the different fields and the types. And this used to be really enough for a majority of the organizations. So as long as you had ACLs and role-based access control on your databases and you had some definition of what the columns meant and what the tables were doing, most organizations were fairly comfortable with just having that as their entire data governance strategy. However, if we look at the data landscape now, it has exploded to a large extent. So there are a variety of tools and technologies that exist. And a lot of organizations are not just using any one of these tools, but they tend to use a variety of these tools. So this is what a typical landscape looks like. And ideally, most of the organizations are using one of these tools in every category. So you probably have some BI tool, you have some tool for data processing, a different tool for data ingestion, and multiple different tools to do data science and to do visualizations and so on. In addition to that, we also have the complexity of many of the organizations now moving to the cloud. It is a huge challenge, particularly in terms of security and infrastructure. Once you move to the cloud, what many organizations quickly realize is that the complexity explodes because now they have to manage all of the end different services that connect together and form what was a single hosted platform earlier. And in addition to all of these things, there is also the complexity in terms of the data itself. So whereas earlier, we just had data that was mostly sitting in tables or databases or data warehouses that dealt with only one particular type of data. You now have data types that are vastly different. So you have these very different structured data formats. And then you also have semi-structured and unstructured data. And it constitutes the entire gamut of data that you can imagine. So log files, text files, images, video, metadata, so on. So given all of these things, given all of the complexity, a lot of organizations have in the past failed to come up with a good data governance strategy. And in essence, this has led to a lot of other side effects of not having a data governance strategy in place. So things like not being aware of the data that sits on their systems, not being aware of who the owner of the data is, not being aware of, like, sure, data quality is a problem. But even when it comes to security, many organizations are not aware of what is the data that they have of their customers and how that data is being used, how that data is being curated, stored, who is responsible for that data. And what are the different laws and complexes that go with it? So over the years, to tackle some of these challenges, there have been many laws that have also come up to tighten the security and the data privacy concerns with regards to the various breaches and leaks that have happened. So you have PDP, GDPR, HIPAA, PCID, DSS, CCPA, and all of these different laws that have come into place. And I'd say that even though organizations are not very mature, they have been there for the last five or six years. So you would expect some sort of maturity and you would expect these organizations to fare slightly better. But if you look at the metrics, and like we really can't look into the internals of these different organizations to assess them on things like quality and some of the other aspects of governance, but security is something that puts out the data governance frameworks and strategies that these organizations have in place in very public view. So if you look at it from last year, so typically most of the organizations are still prone to data breaches. They have suffered damages in average of US, like three to four million US dollars. Time to identify a data breach and this is one of the major ones. On average, it is 279 days and the average worst time by industry is one year when it comes to healthcare. So what this typically means is although these organizations may not have good security, but what this also goes to show is that a lot of these organizations don't have a very clear or good data governance framework in place to even comprehend the policies that they should put in place, the compliances if they're meeting them and the ability to understand if there are security lapses or issues in their ecosystem. So why is that? What is it that these organizations are missing out when it comes to having a data governance framework? The security aspect, well, it's more beyond that. So without a good data governance framework, most of these organizations are missing out on multiple other capabilities and capabilities that can improve the overall functioning of the organization and drive growth. So they tend to miss out on, of course, being able to track data sets across organizations. They tend to fail when it comes to creating data ownership and accountability of the data that they're generating. There is also often an issue of data quality. Most of the organizations are realizing it now that the data that they have, even if they ingest all of it, if it's not of good quality, it will often give them the wrong insights and lead to the wrong business decisions, which isn't always great. They, of course, in addition to doing this, they fail to improve productivity for their own teams. They fail to reduce friction between teams. So without a good framework, these teams won't come to know about the data that sits on the system. Most of the times, they end up building redundant processes, and they also fail to have processes in place to get access to the data even when it comes to the internal organization itself. So a lot of large enterprises really face a majority of these issues. There are more, but these are some of the major ones that I've seen at least. So in essence, without these capabilities, most of these organizations, it has a very direct impact on how they can scale and how they can grow as they become larger and larger, both in terms of the data, but also in terms of organization growth, in terms of maturity, and in terms of people and skill that are available in the organization. So let's look at what are the challenges to solve when it comes to typical data governance. So of course, one of the very clear things that a lot of organizations are starting to realize now is the need for cataloging. There is also a need for lineage, which is not a lot of organizations have in place at the moment is what I've realized, but lineage gives you the ability to track how your data is being curated and is especially useful if you have very large ETL pipelines and data flow, that data flow between multiple hops and multiple nodes in the system. Tagging and classification, so a lot of organizations are doing this to some degree, trying to make sense of the data sets that they're curating, but the tagging and classification frameworks are not really mature in my opinion. So typically, you want to classify things as like the domain of the data that you're curating, things like who is the owner of the data, what is the sensitivity level of the data, and so on and so forth. Besides these three, there is also the concern of having or enforcing security and policy control over all of your data sets, having the ability to validate if you're in compliance through audit, and by this ability, ideally what I mean is that you should be at a place where you have semi-automated controls in place that give you this ability out of the box and the ability to do this in a repeated fashion. A lot of organizations use GDPR and these various different laws as a driver to their data governance framework, but there isn't a very clear understanding when it comes to what it means to be compliant with these laws. And there are also like these second or third level concerns, which in my opinion are very important, but I've rarely seen them being given importance. So things like business glossary, which is the ability to give description to all of your data sets and the ability to rationalize it further and build on top of what you already have, and also the ability to have custom metatipes. So in this example, for say like a custom metatipe sort of models the various different ways in which an organization thinks about its data. So for example, if you're following a data product paradigm where different teams are creating data products, you would want to model a custom metatipe, which gives you that language to describe and proliferate that understanding throughout the organization. And of course, the ability to also visualize like your very large AIML ETL pipeline. So this is often the case where in organizations there's a lot of complex AIML ETL pipelines and jobs that nobody understands to the full degree. So having this in a catalog often gives you that ability to have it to visualize it and see it in front of you and be able to understand it better. And like it also gives you the ability to see how your jobs are set up, how they're triggered and be able to optimize the overall flow, which may set between like multiple teams. So given that there are these challenges to solve, a lot of organizations in a lot of organizations, this is the typical architecture or setup that you will see. And like a key takeaway of this is that there is no one tool that does every job. For a lot of these organizations, you will see a lot of different tools being used across for a lot of different functions. And even among these different functions, there are multiple alternatives. So you saw from what you saw in those images that I showed you earlier, these were the set of tools that I sort of covered, but there are many, many more open source tools that sort of tackle the various different layers of data governance. So there is like this notion of complexity when it comes to the tools and the architecture. But what a lot of people don't realize is there is a lot of complexity when it comes to the organizational structure as well. So you typically, with a framework like this, if you have a large data organization, you will typically have all of these various different roles. And it is very crucial for all of these different roles to function in the best way possible to have minimum friction and to be aligned and to be aware of the data governance framework that you put together for them. Because ultimately, they are the customers that you are serving and they are the key people who are going to be using and creating or rather partaking in the creation of your data governance strategy. So if we look at all of this, so far we have talked about what are the challenges, the technical challenges and what are the technical complex theories. But let's look at how the organizations fare right now and what are the business motivations why the industry hasn't really caught on. So if you look at what most of the organizations, where most of the organizations are today, there is very low maturity except when it comes to these very large organizations like Netflix, Amazon, Google, the industry is definitely catching on. Most of the enterprises have started to realize this need. But right now, there is also a lot of hype around data governance and most of it comes from a security and a GDPR perspective. Additionally, a lot of large enterprises have also completed or initiated their moves. But what they have failed to realize is that they do not have the right governance in place to be able to tackle that complexity. These organizations don't often understand what are the different sensitivity levels of the data, how they should be curating the data, how should they be securing it, what are the needs in order to be able to comprehend these data systems. And in addition to that, from what I've seen a lot of organizations are trying to drive for anonymization frameworks where they want to be able to put or segregate a particular type of data and just be able to say, hey, this data is sensitive and this data is not sensitive. But the anonymization frameworks are limited to just doing those very simple things of just like, hey, let me put this in a separate, like for example, a separate S3 bucket. And people don't really think about what it means to really anonymize and secure that anonymized data. So nobody thinks about, people think about the access to the anonymized data, but not about infrastructure, not about people who are using that data, so on and so forth. So why is this the case? Why are organizations not able to fare well? And why do they have these different problems? So in my opinion, these problems are a result of these three broad categories of failures that I've seen. And most often, organizations have at least one or all of these categories applied to them. And I've tried to cover only the key issues that I've seen. So if we talk about strategy, one of the key issues that I've seen is people still hold on to a very old way of thinking about data. So the first slide that I spoke about where I spoke about how data, where houses were doing data governance, that is still the case. Like when a lot of orgs think about data, they think of it as being able to slap some role-based access controls on a table somewhere and call it a day. But that is not enough in today's world. In addition to that, they often also end up thinking about the tools and tech stack that they need. But they fail to think about how they're going to change the organization and the culture that's within the organization. They fail to articulate the business value of these systems internally. So often they don't understand these systems internally themselves. And even if they do, they often fail to articulate value to other business stakeholders and get proper funding to be able to move forward with a data governance strategy. And it's often very hard because most of the organizations are only able to rationalize things when it comes to data security. But what they don't realize is all of these other frameworks, all of these other capabilities and features of a good data governance framework also define how quickly they can build and how quickly they can mature as an organization. So it's a very abstract thought in my opinion and not a lot of people have that aha moment where they realize this is something that will help them go along with. And lastly, like most of the organizations are looking for a one-stop solution to this, which in my opinion is not going to happen. The other issues are like that I've seen is of course, making it all about compliance, letting all of the governance being driven just from a security perspective and thinking just about GDPR insecurity and not focusing on all of those other aspects of data governance that I spoke about, not focusing on data quality, not focusing on things that make it easier to rationalize about the data in the organization. And also, when they start out, because the tooling is not very mature, there are a large amount of tools that they have to deal with. And they often end up spreading the security mechanisms across like a multitude of tools and not having a single holistic view of security and auditability in place. And lastly, getting a vendor to fix your compliance problems. So this is the most common thing that I've seen. Like while it's okay to bring in someone who has the expertise, what organizations fail to understand is nobody is going to be able to give them a customized solution unless they are also involved in building the data governance strategy. So you need to work together with your vendor and not expect them to do the job on their own and be able to expect something good to come out of it. And lastly, when it comes to the implementation and tech challenges, we've already spoken a lot about it. But some of the other key issues that I see is, of course, like in terms of onboarding the different teams to sort of understand why data governance is important, a lot of times people just mandate it. And that makes these different teams lose interest, create a lot of friction, and ultimately the business decides not to go with a governance strategy. There is also a lack of skill maturity in the industry when it comes to data governance frameworks. So a lot of people are starting to talk more about it. But I think it still has a lot of way to go before it can mature. So given all of these challenges and given all of these issues that we've seen, what is a good way to go about building a data governance strategy then? So if you ask me, I think there is no perfect one data governance strategy. What is important to understand is that you won't be able to achieve good data governance from day one, like within a year or within a two year, which is often like the target when it comes to these large organizations, they will put out a certain budget and and like you need to get a data governance strategy done in within that budget and then like call it a day. But that's not how data governance works. What is important is for you to understand that a lot of like that it is a journey where you should incrementally build it out to avoid as much business loss as possible when it comes to just building the tooling and also to better understand what is the right strategy that should apply to you. So day one, you're not going to be able to understand the right strategy. You may have some requirements in place. But from what I've seen is those requirements often don't hold true once you're six months or eight months into building frameworks like these. So in my opinion, a good way to start with this is just to be able to put a very lean framework together in the first few phases of putting your data governance strategy. So what I would recommend is have a discovered phase, which is the very initial phase when you start out and what you should aim for is to introduce governance as a by-product of tooling. So you should build tools and give them to teams. And as a by-product of, for example, using those tools, you should be able to get the metadata or the data that you need. So for example, if you're using Spark, then maybe try to build a Spark connector that helps the various different teams in writing data to whatever location that they want to write to, but also curates the data onto whatever catalog you're using, for example. So instead of mandating, what you need to do is in the first phase, not get in the way of the various different teams trying to achieve their business goals, but give them a very easy way of integrating with whatever solution you're trying to build. It is also from what I've seen useful to follow a pull-based model and just discover these data sets. So if, for example, you know the various different S3 buckets where people will put their data, what you can do is just crawl these S3 buckets every day and get all of the metadata of the data sets. And that will give you a good understanding of the data landscape that you have before you try to build systems that sort of build the next level of systems really. So once you have a sense of your data landscape, what is often useful is to start introducing compliance and audit. So typically what you will do is you should have a separate data governance organization which tries to tackle this problem of understanding what are the very basic minimum criteria of security, security audit, data quality, and things like these that should be enforced across the organization. Have them formulate these rules and while they're formulating these rules, you continue to build your governance framework. Ideally, a central catalog is what works best in this phase. And it is in this phase, in my opinion, where you should start moving towards a push-based approach where you have the producers of data curating all of this information into your governance framework. So people who are producing this data should be pushing things like what is the data quality of the data sets that they're curating? What is the sensitivity levels of the different data sets that they're curating? And you should start mandating this across the organization. So as an outcome, what you would expect is some enforced basic governance and security in place. And additionally, what is also useful if you can is build like an automated audit system, something that runs in a CD pipeline, for example, where you get automated reports of all of your compliance and security and infrastructure in a single location if possible. And once you're done with these stepping stones, that is when you should focus for the last and final phase of building this data catalog. And this is like a never-ending phase. So ideally, what you want to do is once you have a good understanding of your landscape, and once you've identified what are the policies that should apply and the minimum security level, you should start federating this responsibility into different teams. So the compliance and audit team should ideally spread out, become part of these different distributed data teams, which are responsible for their own data. So there's a data steward and a data governance individual in each of these teams, driving these requirements at a team level and not at us as a central sort of enforcement director. So yeah, and why this is important is often like when you scale out, when you scale, start scaling beyond two or three different teams, it becomes very hard to be able to rationalize about the different text tags that these different teams are using, be able to understand how different teams are using these datasets, and be able to understand the context of the data, like the domain of the data itself. And that is where it's useful for these individuals to be working directly with the teams rather than chalking things out as a separate entity altogether. And ideally, if possible, what we should also start doing in this phase is like if we can sort of distribute the catalog itself, and although have like a central catalog, but distribute the data ingestion and curation ability to multiple different teams, and these different catalogs curate the data further to a central catalog. In my opinion, that is the ideal state given the current complexities that are there in the data world. Now in terms of tooling, the tooling is not necessarily as mature when it comes to achieving some of these ideas. However, there are frameworks and open source frameworks that have come into play in the last one or two years. For example, Igaria is a really good example of one of these frameworks where people have started talking about it and started realizing this as a key idea to be able to enable really quick growth within organizations when it comes to data and teams. Right. So this was it. Thanks a lot for attending the talk. I'll leave you with some of the useful links that I used to curate some of the material in these talks and read up more about about these tools and about the open source frameworks that are coming up. Thank you, Arthiv. That was really great. Do you have any updates you would like to share with us on data governance strategies, any comments? Thanks, Anvesha. So yeah, definitely. I think ever since the talk happened for the first time, one of the things that I've heard a lot about is these aspects of data governance and federation. These ideas have caught on in the last one, one and a half years. Personally, I see a lot of people trying to do it by failing due to the reasons that we spoke about. There has been some progress in terms of tooling. So there are a lot of frameworks that have come up. The challenge still remains, however, that they're very specific to the platform. For example, Databricks has come out with Delta sharing and then AWS also has Lakehouse, which is really, really good if all of your tooling is on AWS or all of your tooling is on Databricks. However, generally what I've seen is people use a mix or a variety of clouds, tooling and different technologies. All of the current tooling currently fails to enable this shared global application of standards and policies. So that's been an interesting progress that has happened in this space. The other set of things that have caught on in the last few months is this idea of a data mesh, which I've invited my colleagues, Samedha and Vanya to talk about and discuss alongside me. But essentially, it is this idea of how do you bake in governance as part of your tooling and how do you enable teams to be self-service? How do you enable them to do a federation and enable them to be independent and not rely on a central body that approves all of the data and all of the actions that they do with that data? So that's been some of the updates that have happened in this space.