 The first thing that I want to do is to introduce the series of data governance meetups, why we thought of starting it up and what we hope to achieve. The main motivation behind this meetup is to create a community or a forum where we can talk about the technical aspects or the technical challenges of data governance meetups, data governance. I have a long history of working on open source, big data projects like Hive, Spark and Presto. And one of the nice things about these projects is that there's a very strong community, both online as well as local if you look at Bay Area or Bangalore, where we can lean on to learn how to get started as well as talk about advanced topics. A couple of years ago, I started working on data governance projects, both in my previous company as well as a freelancer. And I helped a bunch of teams to get started on data governance from zero when they had no processes at all. And one of the issues was that we couldn't find a community or a library that we could lean on to learn what are the best practices in data governance. And if you do hear about a lot of successful open source projects like Data Hub from LinkedIn, Amundsen from Lyft, but these are the final state of successful projects, what is hard to find is information of the journey about the missteps and the failures and the challenges that they face while they built these systems up. And this kind of information is really useful for someone else who is starting off in the journey. So we wanted to create a forum and community where we could ask these questions from people who are much ahead of us in the journey. A couple of steps that I took, let me share my screen as well. So a couple of steps that I took to see if a community is required or if there are other people like me who are looking for information on data governance technology is I open sourced a bunch of projects that I had built as part of my freelancing gigs and hosted a token. It's also on GitHub and I started a newsletter where I just put in my research on data governance. So what are the options when you want to build a data catalog? What are the options when you want to generate a data lineage? And when you're just doing research on this stuff, I started posting topics in the newsletter with the webinar that you see on the screen. And the feedback from these two activities shows, they get pretty strong signals that there are other people like me who want to learn more about the technology behind data governance. And the next logical step was to get in touch with Zainab and Hasgeek to see if we can build a much bigger community than what I could do from the open source projects in the newsletter. And just after a bunch of brainstorming and talking to some of the experts in data governance in Bangalore, we decided to have these series of meetups. Hopefully we'll have a few more initiatives to kind of foster the community. So there's obviously a lot of selfish interest over here. The goal of this community is to bring in people who are experts who have more experience than me, so I can ask them questions. And hopefully this is useful to a bigger crowd as well. So in this first meetup, I will first spend about 20 minutes just talking through the experience that I went through. So I started off from scratch even though I have a background in data. I didn't pay much attention to data governance until a couple of years ago. So I had to learn about this field from scratch. And what I'll do is I'll spend about 20 minutes just talking about the journey that I went through. It's kind of semi-autobiographical. And then in the second part, we'll have a bunch of further sessions where we discuss a few topics among people who know a lot more than I do. So let's get started. In the first half of this meetup, looking at the stock, I will cover what I learned is data governance. Why is data governance hard or why I found data governance to be hard. And also examples of automation to accomplish data governance tasks. And then we'll summarize and move on to the next session. So why is data governance hard or what is data governance? Data governance means different things to different people. If you ask them, there's a bunch of definitions out there and it's very hard to come to an agreement. The way that I found it easy to understand what data governance is, is to define the outcomes that data governance tasks should deliver. And in my case, and in the projects that I worked on, the outcomes of data governance were compliance. That is, you want to understand the life cycle of data and whether it's usages in accordance with laws and regulations. Whether privacy, data governance tasks help with privacy. That is protect data as for regulations and user expectations. And finally, security. Do data governance tasks enhance security or to prove that the security of data and data infrastructure is adequate. So data governance is pretty much any task. Right, capturing and managing the metadata of data, of users, of workflows, of any other information that you would require to be able to achieve the outcomes that you want from your projects in the areas of compliance, privacy and security. And with this definition, we could make some reasonable progress with the projects that we worked on. Why is data governance hard, right? And let me try to explain this with a story. So, you know, there are, there's paparazzi everywhere that in New York City as well and the celebrities in New York City. And it's very common for people to go for the celebrities. And since taxis are a common mode of transportation, you see photographs of celebrities with taxis. And taxis have their medallion numbers and their registration numbers, which are also available in these photos. If you go search for similar photos of celebrities in New York City, you'll find a complete data set and a pretty huge data set going back many years of a bunch of photos. Many of these phones, the latest cameras and camera phones and the latest DSR are also geo-tagged and have a timestamp. Completely independently, what happened was the Taxi Commission of New York City released a bunch of reports on taxi usage in the city. And a researcher got curious about the data set and used the Freedom of Information Act, that is an act in the U.S. to get the raw data that the Taxi Commission of New York City had. So the Taxi Commission anonymized the data or they thought they'd anonymized the data and released a whole year's worth of record, right? And they continued to do this. So if you go and search for NYC TaxiCab, you will find data sets for every single year all the way to like I think 2019, right? And the researchers used this data set for finding all kinds of transportation details, how people get from one point to the other, the popularity of taxis and so on and so forth, right? Which time it is popular and so on, et cetera, that in New York City. However, someone figured out that they could take these two data sets, the first data set being the geo-tagged and timestamp photos that are available on media websites and the taxi usage data set, right? So they combined them together and drilled down to the records of the taxi rides that celebrities took. And they could figure out that what are the tipping habits of these celebrities, you know, how much did they tip based on the fare, right? Like you said, this is an example of an insight that no one, at least the original owners of the data set or the original people who generated the data set didn't expect, right? So the media of the paparazzi didn't expect that them taking photographs would end up exposing the tipping habits of celebrities and neither did the Taxi Commission or the researcher who initially requested this data set, right? So this is an example of a link attack and there are multiple problems or multiple data governance failures over here, right? So there's, first of all, there's too much data or there's the ability to link to data sets that weren't supposed to be linked together. There was information that people shouldn't have had, right? So there was not good access place and there was no way to monitor how these data sets are being used and for what kind of insights these data sets are being used. Are these data sets being used for the original request, right? So let's get into a little more details about all these problems and let's see how we can get started on solving some of the issues that we saw in the story. So the first real problem is that there's too much data, right? I show on this slide, you see an example of the data sharing agreements that a company like PayPal has, right? So PayPal exchanges data with a bunch of entities and if you see, a couple of them are pretty interesting, right? If you look down the list, marketers and publicists and operational service providers are an interesting list of companies that share or that have access to PayPal transactions, right? So you wonder why someone like Amazon Web Services, not even Amazon, the e-commerce company, but Amazon Web Services has access to the transactions that you made on PayPal, right? So if you have a lot of data and you have the ability to join different data sets, you don't know what kind of insights are there, right? And this trend has continued to grow. There's more and more data being generated and there's more and more data being shared, not just among we as private citizens with companies, but even among companies, that you don't know what kind of insights can be gained from these and if they're harmful or not. The second problem is complexity. The trend in the industry is to have a product for every single niche, right? So instead of having like in the old days, I think there are a couple of database companies like Oracle or SQL Server which satisfied most of your infrastructure needs and you had a couple of instances of these that you could protect infrastructure as well as the data. Now you have many, many different infrastructure pieces that you put together and each of them solve a niche, right? So if you just look at the data infrastructure industry as there are approximately 1,500 data technologies. These are both commercial as well as open source. I did a quick survey among the companies that I know and I work with. And there's typically about 8 to 10 different data infrastructure components that they use right from production databases like MySQL and Postgres to cloud storage like S3, maybe a data warehouse in their data lakes or Hadoop components, right? So there's about 8 to 10 different components. And you have to protect every single one of these components at the same level and that's pretty hard because different projects and different commercial products have different capabilities when it comes to compliance, privacy and security. The similar trend is there when you look at marketing technology companies or sales technology companies. These are Indiana data processing and data storage technologies underneath, right? So with increasing complexity, security is a huge problem, right? And if you don't have enough security, then you are compromising on privacy and compliance. The third problem is that there's no context for data usage. Fundamentally, analytics, data science and AI have competing objectives when it comes to compliance, privacy, and security. Analytics, data science, and AI, you want to have as much access to data as possible so that there's sensitivity and, you know, people try out different things and they find out different insights and that can drive business, right? On the other hand, the best compliance, privacy, and security is not to give access to anyone. However, that's not practical. So both extremes either give complete access to data or not give any access to data. Both of them are not practical. You need to have a much more nuanced approach to providing access and as well as monitoring and auditing usage of this data, right? However, there's not enough information, right? The information that you need is who's using the data, for what purpose and when, right? You need this trial of information to make intelligent decisions about whether the data is being used appropriately or not. So these are the three different problems that at least I saw when I was helping companies with meeting their goals, insecurity, compliance, and privacy, right? Next, what I'll do is I will go through a few examples of how we went about solving this, right? So just to set context, these were companies that didn't have any data governance or any data security, compliance, and privacy process in place. For various reasons, they were compelled to add these controls so that for either regulations or because of security breaches and so on and so forth, right? So we pretty much started off from zero and we started off by trying to answer basic questions, right? Questions that we could answer and questions for which we could automate to get the answers, right? So instead of trying to do manual tasks to control access, how could we automate? So the three questions that helped me or helped me and the teams that I worked with to get started on data governance are, where is my data? Who has access to my data? Who has access to the data and how the data is used, right? But stepping back a bit, you probably don't really care about who has, I mean, about all the data that you want. You probably care about the sensitive data that you want and the definition of sensitive differs between industries and companies, right? So financial data is sensitive. PI data is sensitive. If you're a health industry related company, then you might be storing information about the health of your customers and that is sensitive and so on and so forth, right? So what you really care about is where is my sensitive data? Who has access to sensitive data and how is the sensitive data being used, right? To be able to understand where sensitive data is, you need three capabilities. You need to have a data catalog where you can store information or metadata about your data sets. What are the tables? What columns do the tables have? What are the data types? What kind of data will they store and so on and so forth? You need to have the ability to scan your database and recognize sensitive data, right? So you should be able to tag the data catalog and say, hey, this is a sensitive data set and this is not a sensitive data set, right? So that you can focus your attention on the sensitive data set. And as I said, in the previous slide, the definition of sensitivity changes from company to company and industry to industry, right? So there are obviously some common definitions, for example, PII and financial, but not for others, right? So you do find scanners for common patterns online, but you also have to build your own scanners for sensitive data that is specific to you. So just scanning the data set is not practical because typically you have huge data sets and you don't want to scan terabytes and terabytes of data. So you need the capability to capture data lineage to tag every single data set within your catalog, right? So typically the way, the practical way for this to work is that you scan your base data sets, right? The data sets that you generated or you got from third parties and find out where the sensitive data is. And then you use data lineage to track how the data moves through the derived data sets and through your data infrastructure across your production database to your data warehouse to, let's say, data lakes and S3, right? And then you build this data lineage and once you have this data lineage, you can use, you can automate tagging of sensitive data in your derived data sets, right? To give an example, I built a open source project called Pycatcher. Pycatcher can scan data in databases like MySQL or even data in AWS S3 and accessible through Presto, right? So it goes and looks at column names and a few rows to figure out which of the columns have PI data. And then it goes and stores in the catalog, right? So the example on the left is a data catalog with schema tables and columns. And one of the columns is whether it has PI or not, right? The example on the right is from a Glue catalog. So AWS Glue is a catalog that AWS provides, a catalog, a data catalog as a service. And now you can scan data in S3 and tag tables and columns within the Glue catalog to specify which of them have PI columns. So for example, in the example on the right, the zone column has PI data and the supply address, right? So the first step is to create a catalog and scan your data sets to tag it. The next step is to use data lineage. Again, to give an example, I built a simple data lineage library in Python, which looks at query history. Looks at statements like, statements that copy data from one table to another and then creates a a diagram or a graph of how the data flows through from one table to another. And now you can use this graph to either visualize it and notice patterns or you can use. This graph sits in memory, right? So you only get built using the graphing libraries. So you have access to graphing algorithms and you can automate and search patterns and as well as tag columns and derived datasets with PII. So you have a whole bunch of options. Once you've built this graph in memory, you can visualize it or you can use some kind of automation to go and tag the rest of the columns in your datasets. So I'll just show a couple of examples of how you can go and answer where is my data or where is my sensitive data, right? So we give it a combination of scanning and running data lineage. You should have a pretty reasonable idea of where all the sensitive data is and which tables and columns that you need to pay attention to. The next one is who has access to my sensitive data, right? So once you have a data catalog which has all the PII data tagged, the next step is to find out who has access to it. Most data warehouses and databases have a table which lists out the privileges or the access control that all the users of the database have, right? And that, did you see that in the image? It was tagged with number two, right? This is from AWS Blue Catalog where there's a table called table privileges. And this tells you which principle or which user has access to which tables. While the data catalog in image one tells you which tables and which columns have PII data. If you join these two tables, you can get a list of users or principles who have access to sensitive data. So there's a very simple way to get a list of people who have access to sensitive data and you can audit it regularly, right? So once you have this list, you can take decisions or you have enough information to take decisions whether this list is okay or this list is too big or there are important people missing this list, whichever way you go. The third question is how is my sensitive data being used? You know who has access to it, but how are they using it? For this, you have to log usage across all databases. So you need to know the query history or the workloads that are running on your databases and your data lakes. Most databases and data warehouse store query history and what they call information schema. Even bigger technologies like Presto and Spark have hooks where you can capture the queries and the workloads and the Spark code that runs on it, right? And you can keep store this somewhere. You can store, you can log and store the usage history somewhere. And then once you have it, you can start looking for patterns to see if there are any misuse of data. Production databases are different. You typically don't want to capture query history on production databases like MySQL or Postgres because there's so much CPU and IO going on that if you go and try to capture query history as well, you will affect the performance of these databases. So what you typically want is you want to put a proxy in there and then give access to your operations team through that. I'll show a couple of examples. The first one is catching query history snowflake. That is a open source project called Snowflake spend by GitLab, which you can run through DBT. So there's a DBT library that's available in DBT hub. So you can download DBT, you can download this library. There's a pretty straightforward instructions to get going and then you can schedule this library through DBT to copy the query history from the information schema and snowflake and then store it in a table that you want. And then once in a while, maybe monthly, weekly, what is required, you can go and look at the query text to see and look for specific patterns to make sure that there's no misuse of the data. Another example is the biggest capture query history from something like MySQL. There are proxies, open source proxies, very popular ones like proxy SQL for MySQL or PG pool for Postgres, which you can install in front of data as a bastion to your actual MySQL instance. All your application workload goes to your production database directly, but any access that your operations or your customer support team needs can go through this proxy. And whatever activity happens to this proxy can be locked and I show you the examples of logs in the bottom half of the slide. So again, once you have all this information, you can go and start looking for patterns of who access certain tables, what kind of queries did they run? Did they try to do a dump of this table or were they only running aggregates? Typically running aggregates is fine. Doing specific lookup is probably okay if it is also with customer support, but probably a select star or an unload is not expected behavior. So to summarize, data compliance, privacy and security is a journey. At least if you don't have any systems in place, you have to start with simple questions and the three simple questions that I found interesting or I found useful as well as the sensitive data who has access to sensitive data and how is the sensitive data being used? You have to think about automation right from the beginning. There is no practical solution without automation and a bunch of resources about the projects and that I spoke about in this talk.