 So let's get started looking at the stock. I will cover what I learned is data governance. Why is data governance hard or why I found data governance to be hard and also examples of automation to accomplish data governance tasks, right? So why is data governance hard or what is data governance? Data governance means different things to different people. If you ask them, there's a bunch of definitions out there and it's pretty hard to come to an agreement, right? The way that I found it easy to understand what data governance is, is to define the outcomes that data governance tasks should deliver. And in my case, and in the projects that I worked on, the outcomes of data governance were compliance. That is, you wanna understand the data life-cycling, the life-cycle of data and whether it's usages in accordance with laws and regulations, whether privacy, they're gonna do data governance tasks, help with privacy, that is protect data as for regulations and user expectations and finding security. Do data governance tasks enhance security or to prove that the security of data and data infrastructure is adequate, right? So data governance is pretty much any task, right? Capturing and managing the metadata of data of users, of workflows, of any other information that you would require to be able to achieve the outcomes that you want from your projects and in the areas of compliance, privacy and security, right? And with this definition, we could make some reasonable progress with the projects that we worked on. Why is data governance hard, right? And let me try to explain this with a story, right? So, you know, there's paparazzi everywhere, that in New York City as well and the celebrities in New York City and it's very common for people to go for the celebrities and since taxis are a common mode of transportation, you see photographs of celebrities with taxis, right? And taxis have their medallion numbers and their registration numbers, which are also available in these photos, right? If you go search for similar photos of celebrities in New York City, you'll find a complete data set and a pretty huge data set going back many years of a bunch of photos, right? And many of these funds with the latest cameras and camera phones and the latest DSR are also geo-tagged and have a timestamp. Completely independently what happened was the Taxi Commission of New York City released a bunch of reports on taxi usage in the city, right? And a researcher got curious about the data set and used the Freedom of Information Act that is an act in the US to get the raw data that the Taxi Commission of New York City had. So, the Taxi Commission anonymized data or they thought they anonymized the data and released and whole years worth of record, right? And they continued to do this. So, if you go and search for NYC Taxicab, you will find data sets for every single year all the way to the, I think 2019, right? And the researchers used this data set for finding out all kinds of transportation details, how people get from one point to the other the popularity of taxis and so on and so forth, right? The which time it is popular and so on, et cetera, that in New York City. However, someone figured out that they could take these two data sets. The first data set being the geo-tagged and timestamp photos that are available on media websites and the taxi usage data set, right? Join them together and drill down to the records of the taxi rides that celebrity stuff. And they could figure out that what is the tipping habits of these celebrities? You know, how much is the tip based on the fare, right? Like you said, this is an example of an insight that no one, at least the original owners of the data set or the original people who generated the data set didn't expect, right? So the media of the paparazzi didn't expect that them taking photographs would end up exposing the tipping habits of celebrities and neither did the Taxi Commission or the researcher who initially requested the data set, right? So there's an example of a link attack and there are multiple problems or multiple data governance failures over here, right? So there's, first of all, there's too much data or there's ability to link to data sets that weren't supposed to be linked together. There was information that people shouldn't have had, right? So there was not good access place and there was no way to monitor how these data sets are being used and for what kind of insights these data sets are being used. Are these data sets being used for the original request, right? So let's get into a little more details about all these problems and let's see how it can get started on solving some of the issues that we saw in the story. So the first real problem is that there's too much data, right? I show on this slide, you see an example of the data sharing agreements that a company like PayPal has, right? So PayPal exchanges data with a bunch of entities and if you see, a couple of them are pretty interesting, right? If you look down the list, marketers and publicists and operational service providers are an interesting list of companies that share or that have access to PayPal transactions, right? So you wonder why someone like Amazon Web Services, not even Amazon, the e-commerce company but Amazon Web Services has access to the transactions that you made on PayPal, right? So if you have a lot of data and you have the ability to join different data sets, you don't know what kind of insights are there, right? And this trend has continued to grow. There's more and more data being generated and there's more and more data being shared, not just among we as private citizens with companies but even among companies that you don't know what kind of insights can we need from these and if they're harmful or not. The second problem is complexity. The trend in the industry is to have a product for every single niche, right? So instead of having like in the old days, I think there are a couple of database companies like Oracle or SQL Server, which satisfied most of your data infrastructure needs and you had a couple of instances of these that you could protect the infrastructure as well as the data. Now you have many, many different infrastructure pieces that you put together and each of them solve a niche, right? So if you just look at the data infrastructure industry as there are approximately 1,500 data technologies, these are both commercial as well as open source. I did a quick survey among the companies that I know and I work with. And there's typically about eight to 10 different data infrastructure components that they use right from production databases like MySQL and Postgres to cloud storage like S3, maybe a data warehouse in their data lakes or Hadoop components, right? So there's about eight to 10 different components and you have to protect every single one of these components at the same level and that's pretty hard because different projects and different commercial products have different capabilities when it comes to compliance privacy and security. The similar trend is there when you look at marketing technology companies or sales technology companies these are Indiana data processing and data storage technologies underneath, right? So with increasing complexity, security is a huge problem, right? And if you don't have enough security then you are compromising on privacy and compliance. The third problem is that there's no context for data usage. Fundamentally, analytics, data science and AI have competing objectives when it comes to compliance privacy and security. Analytics, data science and AI you want to have as much access to data as possible so that they're set up with, so that there's sensitivity and, you know, people try out different things and they find out different insights and that can drive business, right? On the other hand, the best compliance privacy and security is not to give access to anyone. However, that's not practical. So both extremes either give complete access to data or not give any access to data. Both of them are not practical. You need to have a much more nuanced approach to providing access and as well as monitoring and auditing usage of this data, right? However, there's not enough information, right? The information that you need is who's using the data, for what purpose and when, right? You need this trial of information to make intelligent decisions about whether the data is being used appropriately or not. So these are the three different problems that at least I saw when I was helping companies with meeting the goals, insecurity compliance and privacy, right? Next, what I'll do is I will go through a few examples of how we went about solving this, right? So, just to set context, these were companies that didn't have any data governance or any data security compliance and privacy process in place. For various reasons, they had to give them a compel to add these controls so that for either regulations or because security breaches and so on and so forth, right? So we pretty much started off from zero and we started off by trying to answer basic questions, right? Questions that we could answer and questions. For which we could automate to get the answers, right? So instead of trying to do manual tasks to control access, how could we automate them, right? So the three questions that helped me or it helped me and the teams that I worked with to get started on data governance are, where is my data? Who has access to my data? Who has access to the data and how the data is used, right? But stepping back a bit, you probably don't really care about who has, I mean, about all the data that you own, you probably care about the sensitive data that you own and the definition of sensitive differs between industries and companies, right? So financial data is sensitive. PI data is sensitive. If you're a health related, health industry related company, then you might be storing information about the health of your customers and that is sensitive and so on and so forth, right? So what you really care about is, where is my sensitive data? Who has access to sensitive data and how is the sensitive data being used? To be able to understand where sensitive data is, you need three capabilities. You need to have a data catalog where you can store information or metadata about your data sets. What are the tables? What college do the tables have? What are the data types? What kind of data do they store? And so on and so forth. You need to have the ability to scan your database and recognize sensitive data. So there are, so you should be able to tag the desets with your data catalog and say, hey, this is a sensitive data set and this is not a sensitive data set, right? So that you can focus your attention on the sensitive data set. And as I said, in the previous slide, the definition of sensitivity changes from company to company and industry to industry, right? So there are obviously some common definitions, for example, PII and financial, but not for others, right? So you do find scanners for common patterns online, but you also have to build your own scanners for sensitive data that is specific to you. Just scanning the data set is not practical because typically you have huge data sets and you don't want to scan terabytes and terabytes of data. So you need the capability to capture data lineage to tag every single data set within your catalog, right? So typically the way, the practical way for this to work is that you scan your base data sets, right? The data sets that you've generated or you've got from third parties and find out where the sensitive data is. And then you use data lineage to track how the data moves through the derived data sets and through your data infrastructure across your production database to your data warehouse too, let's say your database in SD, right? And then you build this data lineage and once you have the data lineage, you can use, you can automate tagging of sensitive data in your derived data sets, right? To give an example, I built a open source project called PyCatcher. PyCatcher can scan data in databases like MySQL or even data in AWS S3 and accessible through Presto, right? So it goes and looks at column names and a few rows to figure out which of the columns have PI data and then it goes and stores in the catalog, right? So the example on the left is a data catalog with schema tables and columns and one of the columns is whether it has PI or not, right? The example on the right is from a blue catalog. So AWS Glue is a catalog that AWS provides, a catalog, a data catalog as a service. And now you can scan data in S3 and tag tables and columns within the blue catalog to specify which one of them have PI columns. So for example, in the example on the right, the zone column has PI data and it's a type address, right? So the first step is to create a catalog and scan your data sets to tag it. The next step is to use data lineage. Again, to give an example, I built a simple data lineage library in Python which looks at query history, looks at statements like statements that copy data from one table to another and then creates a tag or a graph of how the data flows through from one table to another. And now you can use this graph to either visualize it and notice patterns or you can use, this graph sits in memory, right? So you only get built using the graphing library. So you have access to graphing algorithms and you can automate and search patterns and as well as tag columns in the right data sets with PI. So you have a whole bunch of options. Once you've built this graph in memory, you can visualize it or you can use some kind of automation to go and tag the rest of the columns in your data sets. So these are a couple of examples of how you can go and answer where is my data or where is my sensitive data, right? So again, with the combination of scanning and running data lineage, you should have a pretty reasonable idea of where all the sensitive data is and which tables and columns that you need to pay attention to. The next one is who has access to my sensitive data, right? So once you have a data catalog which has all the PI data tagged, the next step is to find out who has access to it. Most data warehouses and databases have a table which lists out the privileges or the access control that all the users of the database have, right? And that, did you see that in the image? It was tagged with number two, right? This is from AWS View Catalog where there's a table called table privileges. And this tells you which principle or which user has access to which tables. While the data catalog in image one tells you which tables and which columns have PI data. If you join these two tables, you can get a list of users or principles who have access to sensitive data. So this is a very simple way to get a list of people who have access to sensitive data and you can audit it regularly, right? So once you have this list, you can take decisions or you have enough information to take decisions whether this list is okay or this list is too big or there are important people missing this list, whichever way you go. The third question is how is my sensitive data being used? You know who has access to it, but how are they using it? For this, you have to log usage across all databases. So you need to know the query history or the workloads that are running on your databases and your data lakes. Most databases and data warehouse store query history and what they call information schema. The even big data technologies like Presto and Spark have hooks where you can capture the queries and the workloads and the Spark code that runs on it, right? And you can keep store this somewhere. You can store, you can log and store the usage history somewhere and then once you have it, you can start looking for patterns to see if there are any misuse of data. Production databases are different. You typically don't want to capture query history on production databases like MySQL or Postgres because there's so much CPU and IO going on that if you go and try to capture query history as well, you will affect the performance of these databases. So what you typically want is you want to put a proxy in there and then give access to your operations team through that. I'll show a couple of examples. The first one is catching query history snowflake that is a open source project called Snowflake spent by GitLab which you can run through dbt. So there's a dbt library that's available in dbt hub. So you can download dbt, you can download this library. There's a pretty straightforward instructions to get going and then you can schedule this library through dbt to copy the query history from the information schema and snowflake and then store it in a table that you want. And then once in a while, maybe monthly dbt, what is required, you can go and look at the query text to see and look for specific patterns to make sure that there's no misuse of the data. Another example is the links to capture query history from something like MySQL. There are proxies, open source proxies, very popular ones like proxy SQL for MySQL or PG pool for Postgres which you can install in front of data as a bastion to your actual MySQL instance. All your application workload goes to your production database directly, but any access that your operations or your customer support needs can go through this proxy and whatever activity happens to this proxy can be locked and I show you the examples of logs in the bottom half of the slide. So again, once you have all this information, you can go and start looking for patterns of who access certain tables, what kind of queries did they run? Did they try to do a dump of this table or were they only running aggregates? Typically running aggregates is fine. Doing specific lookup is probably okay if it is also a customer support but probably a select star or an unload is not expected behavior. So to summarize, data compliance privacy and security is a journey. At least if you don't have any systems in place, you have to start with simple questions and the three simple questions that I found interesting or I found useful as well as the sensitive data who has access to sensitive data and how is the sensitive data being used? You have to think about automation right from the beginning. There is no practical solution without automation and I'll show a bunch of resources about the projects and that I spoke about in the start. Hi everyone. Welcome to another exciting session of privacy mode from Hasgeep. This is Hasgeep's initiative for a deep dive into privacy and security as we go increasingly digital in daily life and have a digital footprint like never before. I'm Rishu, part of the Hasgeep community and been working in the industry for past 15 to 16 years dealing with a lot of data platforms and transformations. We've had some really amazing talks in the past some of which you should definitely check out. Coming to the topic at hand, for a long time organizations have been collecting data in order to find human and personalize their customer offerings. However, given the increasing focus on nature of sensitive data and threats and compromises would pose to owners of the data and let's not forget that the ownership of the data is still with the customers. They are the ones who own the data, the companies and organizations that collect data merely custodians. These organizations are increasingly looking or they are made to look into having the right set of processes and tools that safeguard customer data as well as deliver delightful value to customers. And this is exactly where data governance comes in. It primarily looks to address certain key aspects of handling data such as usability as to how the data is being used, accessibility, who has access to data and the third part is security which is around safeguarding data against malevolent actors both within and outside the system. In order to address above data governance initiatives usually result in certain key outcomes one of which is compliance which is adhering to standardized industry-wide guidelines based on the domain in which a certain organization is operating. What it helps is it helps organizations to gauge at how much of a risk they are or where is it that their governance maturity levels are at. Second is privacy, which is around deciding that since the data is owned by the customer the customer gets to choose what data do they want to expose to the organization and what data it is that the organization is not going to get ready access to. And this is where a lot of PIA and PAI conversations also come in. The third part is cybersecurity and infosec which we've covered before is all about ensuring that any of the malevolent entities do not get easy access to the data or the data breaches. Now, while implementing data governance there are a bunch of challenges that each organization faces. One of which is that data is being collected by disparate systems in a different possible fashion and trying to apply a certain set of rules to these vastly distributed systems which have been collecting data for probably more than a decade in a lot of scenarios is definitely not an easy task. Second, there's a lot of complexity in the way data is managed, the way it is handled and the way it flows across various ecosystems. And the third is the context of data usage which is you cannot really apply rules of thumb or only macro level governance rules to all the data that you would have in an organization but contextuality plays a very important role. In order to take us through this journey today we have Roger from LinkedIn and I'm very happy to welcome Roger to this forum not just as a Hasleek member but also as an ex-Lingdon one. He'll be talking us through some of the aspects of data governance and commonly available tools and technologies that he himself has run through. He's a pretty avid practitioner and some of these tech and tools can readily be used to bootstrap data governance in respective organizations. So with that I hope you guys really enjoy this talk and with that I would hand it over to Roger.