 Hello and welcome. My name is Shannon Kemp and I'm the Chief Digital Manager at Data Diversity. We'd like to thank you for joining today's Automated Data Governance 101, a guide to proactively addressing your privacy, security, and compliance needs sponsored today by Amuda. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. If you'd like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the bottom middle of your screen for that feature. For questions, we will be collecting them by the Q&A section in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag Data Diversity. As always, we will send a follow-up email within two business days containing links to the slides, the recording of this session, and any additional information requested throughout the webinar. Now let me introduce our speakers for today, Andrew and Matt. Andrew is a Chief Privacy Officer, Legal Engineer at Amuda. He is an internationally recognized expert on data privacy and the intersection of law and AI leads Amuda's legal engineering team comprised of lawyers with deep enterprise in data privacy, security, and data science focused on automating compliance and oversight activities within Amuda's software platform. Before joining Amuda, Andrew was a special advisor for policy to the head of the FBI Cyber Division where he was the lead author on the FBI's After Action Report about the 2014 Sony data breach. Andrew also served as Chief Compliance and Privacy Officer for the Cyber Division overseeing privacy and compliance policies for sensitive data across the FBI's 56 field offices. Matt is the Director of Global Solution Architecture at Amuda. He has over 15 years experience in architecture and engineering in large-scale enterprise data center infrastructure. Matt came to Amuda from Hewlett Packard Enterprise where he was Chief Technologist working with enterprises on their hybrid cloud initiatives. We're excited to have both of them here and with that I will turn the floor over to Andrew and Matt to get today's webinar started. Hello and welcome. All right, this is Andrew Burt talking. Thank you so much, Shannon, and thanks to folks who dialed in. We're excited to be here today. So I'm going to kick things off with a deep dive into automated data governance, what it is, why we think it's so important, really kind of the one introduction to the concept, and then I will kick things over to Matt in a little and he's going to provide a deep dive into some concrete use cases. So to start with is the question of what exactly is data governance? So this is a term that we see bandied about with really a wide frequency and it's very generally just kind of overused. It means in some cases, I think, everything to everyone. And therefore in practice, really nothing to most people. And so to start off with, I wanted to talk about what we view as the three central components of data governance practice. And so to start with is the notion of privacy. We're going to dive in a little bit deeper, privacy and the others. But I think a good kind of working definition for privacy is basically privacy is about stopping people from using our own data to make insights that we don't want them to make about us. So that's the consumer perspective. From an enterprise perspective, it's all about protecting the data that organizations are entrusted with. And so the harm really is about a loss of control. It's about a loss of control for what data can be used to do what. The second key component of data governance is security. This is most frequently thought of as kind of just protecting information from an adversary. We're going to dive into a little bit more detail about what information security means in practice. And then compliance, which I think can be thought of, is just kind of like a minimum threshold for are you meeting the exact requirements of laws and policies. And so this is data governance in a nutshell. We're going to, of course, dive a little bit more deeply into these three components. But I think it's worth starting with this question of why is data governance so difficult? Beyond simply the fact that it means so many different things, so many different people. And I think at root, it's so difficult because each of those three different key components, security, privacy, and compliance, each have their own really, really difficult problems. So I want to walk through kind of the core central problems that make protecting each of these components so difficult in practice. So starting with privacy, what exactly is the root cause of privacy being so hard to protect in practice? And I think the answer there is this fact that there is just too much data. We'll get into this, but we generate as individuals and we collect as organizations a mind-bogglingly large volume of data every day that number increases. And as a result, it can be very hard to protect data and the insights that data can be used to generate. So on this problem, I wanted to start with a story about Judd and Leslie. So this is Judd Apatow and Leslie Mann, the famous director and actor. Also happened to be a married couple. So this picture was taken June 21st of 2013 near Judd and Leslie R, a taxicab in New York City, living very happy lives. And I think this was snapped by a member of the Paparazzi. So around this time, a few months later, a New York City researcher named Chris Wong is bruising Twitter, and this is March of the following year, and he sees a post from the Taxi and Limousine Commission. It's basically just a chart of traffic patterns in the city. And out of curiosity, he makes a freedom of information request, asking this public entity to release records for 12 months of data, for all the data in 2013. And as a result, he received 50 gigabytes of data. On the right here, you can see the email that he received from the actual Taxi and Limousine Commission. And this is a sample of what was actually released. They released 12 months of data containing taxi pickup, information drop-offs, location signs, et cetera. And at the bottom here, they actually tried to protect this a little bit more. Some of the actual columns were hashed. And so this seems pretty harmless, right? It's not particularly, definitely not malicious. It doesn't seem as if they're doing anything that could harm anyone. The truth is this caused a huge headache for Judd and Leslie through this information. It was folks in the media were able to figure out how much they tipped. And they were able to do this by geotagging this picture and querying the medallion at the time. So we were able to figure out what their actual fare was, what the tip amount was. And so I think the key, one of the reasons why I'm starting off with this, one of the key takeaways is that this is an example of a link attack. So on its own here on the right, that New York City taxi data doesn't seem to be particularly sensitive. And on the left here, that photo of Judd and Leslie happily driving in a New York City taxi cab, that also doesn't seem to be particularly sensitive. But when we combine the two, we can derive insights that are not entirely predictable from either data set alone. And so as we start to generate more and more data, link attacks become an increasing problem. Of course, it's worth saying that this wasn't just a problem for Judd and Leslie. It turns out that a number of different other celebrities were exposed in this, and some of them it actually did create a PR headache somewhere alleged not to have tipped at all. Here's one example, Courtney Kardashian and Scott Disick. This incident also got Bradley Cooper as well, alleged that he doesn't tip when he takes taxi cabs as a result. And the same thing for Olivia Munn. So, well, this does seem kind of fun and light. It did create some headaches and some problems for folks. And of course, this is again just symptomatic of this broader problem, which is as we generate more and more data, it's very hard to understand what that data can be used to do and what about that data is particularly sensitive. So this comes from a study at MIT. There was a credit card transaction database of over a million different users, excuse me. And that credit card transaction database was thought to be anonymized. And it turns out that researchers were able to prove that with just a few data points outside of that data set, they were able to de-anonymize the data. So this is really all about the increasing prevalence and the increasing ease of link attacks to reveal potentially sensitive information. So as we go forward, as link attacks become easier to conduct and there's more data that we can actually use for these types of attacks, we have to kind of just accept that privacy as we know it is incredibly difficult, if not in some cases, possible to uphold. Things need to change for us to adequately protect the privacy of the data that we generate and the privacy of the data that we collect. And so here at IMUDA, we are very firm believers in purpose restrictions on data. The key component of increasingly large number of regulations on data, which we'll talk about shortly. But the key kind of concept is that placing, limiting how data can be collected is just not enough. We need to assume that more and more data about us is going to be released. And as a result, controlling how that data is used is really, really central to how we can actually protect it. And it also, this trend in what it really means to keep data private and secure and protected. Excuse me. It also leads to this question of how do we actually balance data privacy with utility? This is a wonderful quote from a legal scholar, Paul Ohm. The data can either be useful or perfectly anonymous, but never both. So the key concept here is that as we are seeking to protect data, we need to realize that it is a trade-off at the core of all data protection efforts. Privacy and with security, as we'll see shortly, what we're really doing is we're making a trade-off between privacy on the one hand, utility on the other. And we can't have a perfect amount. But in fact, this chart here on the right, if you can think of accuracy and utility as being quite similar, and so what we're really saying here is that the way to make sure that individuals' privacy is protected is not a science, it's an art. And it's an art about making a trade-off between privacy on the one hand, utility on the other. So that's why privacy, I think, is so difficult in practice. I think the sheer volume of data is a key component of the privacy problem. And so on to security. So the security problem, why is security so hard to implement? I would say that the key root of the difficulties there are just this overwhelming amount of complexity. And so to start with, let's just get a baseline definition of information security. It's traditionally defined as a, it's called the CIA triad. So we have confidentiality, which means only the right people can view the right data. Integrity is only in the right form. And availability is at the right time. So these three together comprise the traditional conception of, you can also think about attacks on each of these three as different attack vectors. So if you want to, excuse me, I'm getting over a cold here, I'm going to have to pause a few times to drink water. Thanks for bearing with me. So you can think of each of these three as ways to attack the security of data. So you might attack its availability, might attack its integrity, say altering or tampering data that someone might want to access or confidentiality, making it so that right, the wrong people should say are viewing that data. And so when we think about today's IT landscape, it is just really deeply defined and characterized by an increasing complexity. So here on the left, we just have some talks about all the data we generate. So 2.5 quintillion bytes of data are generated each day. By some estimates, 90% of the data in the world was generated in the last two years. And there are an estimated 50 billion devices that are going to be connected to the Internet by next year, which is over six per person on the planet. And so just thinking about how to protect all that data and how to frankly even just understand it is incredibly laborious and it's incredibly complex environment. So that's just the data that we are generating as individuals and then collecting as organizations on the one. Now when we think about the software we use, it is similarly defined by a huge amount of complexity. So we have just the statistics here on the right. I think there's a wonderful chart from NIST documenting the known vulnerabilities and software systems over the last couple of years. And it is very clearly increasing over time. And similarly, the complexity of the software systems that organizations are using and tapping into also appears to be increasing. And as we begin to adopt more and more AI and machine learning based tools and techniques, all of these trends are really becoming exacerbated. And so all of this adds up to basically complexity as I think at the root of the modern IT landscape. This is actually a quote to paraphrase the cybersecurity expert Bruce Schneier that complexity is the enemy of security. So it's very, very difficult to think about protecting an environment that is just defined by lots and lots of different moving components that are incredibly, incredibly hard to track. I mean, in fact, and we talked about this trade off in the world of privacy and that trade off really translates to the world of security. And so this is another wonderful quote that comes from cybersecurity pioneer Willis Ware. And he said basically the only computer that's completely secure is a computer that no one can use. And so again, on the one hand, you have full utility and on the other hand, you have security. And when you're trying to achieve both what you're really trying to do, you're trying to balance again an art, not a science, you're trying to figure out the right way to make a trade off. So OK, that was the privacy problem, the security problem now on. So what is it that's at the root of this problem? I would say, and this is a little bit unintuitive, but the key challenge here is simply the fact that there's not enough time. Now, I'll get to exactly why I think time is the problem, but we can just start off with this basic fact that the number and the complexity of regulations on data is really increasing drastically over time. I like to say that if you just kind of close your eyes and throw a dart at a map of the world, chances are that dart is going to hit a state, a jurisdiction, a region where the number of policies and laws and regulations on data is increasing. And so just as a sample, we looked at the U.S. alone just to get a sense in 2019, just at the state level, not thinking about federal privacy laws. And over 150 privacy laws have been proposed in 25 states throughout this year and over 250 laws on information security have been proposed in 45 states. And of course, all of these different laws have huge, huge costs. And so estimates say that all of this could add up to some very high numbers for organizations. So a few examples of some of the major regulations that are really driving the data security and privacy landscape is GDPR, which most folks are familiar with. It came into force in May of 2018. Really, I think as the first wave of this new era of really robust stringent privacy laws, security laws on data, the fines of up to 4% of global revenue. We have the CCPA, the California Consumer Privacy Act, which was passed just about a month after the GDPR came into effect. It'll come into effect next year. There's a lot of developments going on. The CCPA is a little bit of a moving target, but I think it's fair to say that it is the most stringent privacy law in the United States. And I would say just as a personal prediction that I expected to kind of spur into action some national federal laws I would expect in the next year or two. And then outside of Europe and North America, there's China's cyber security law, which was enacted a few years ago. And again, just plays to this broader trend of it is increasingly difficult and there are more and more penalties and the compliance burden is growing on using data when it comes to protecting that privacy and that data security. And of course, non-compliance have very serious consequences. Here are some headlines related to security breaches. Many of them are tied to some of the laws I referenced on the previous slide. We have Equifax, Marriott, British Airways, among others. Again, I think the key point is just that as governments are enacting more and more regulations on data, the fines for mishandling data are increasing. And I would expect them to increase over time. So back to time, why is it that time is really the main challenge here? The main challenge is because it involves a huge amount of analytical work to try to understand and integrate all of these different complex laws and all of these different jurisdictions. So if you look at this map here on the right, this comes from the law firm DLA Piper, the global law firm, and it's what they call their heat map of laws on data from heavy, robust to moderate and limited. And so when you look at a map like that, if you are a global organization or an organization that's just operating in multiple jurisdictions, it can be incredibly difficult and incredibly time consuming to try to figure out all the different rules that attach to the data you want to use and to integrate them. And so these processes to try to understand what's going on and to try to implement all these different rules are not simple and they are not fast. And so we just simply do not have enough time to try to integrate and comply with all these rules in traditional methods, which we'll get into. So when we think about how organizations are actually addressing all of these problems, the privacy problem, security problem, the compliance problem, what are they actually doing? How are they doing? What is kind of the center of the current approach? And the answer there is that organizations are all overwhelmingly, in our view, approaching this challenge passively. By passive approach, basically what we mean is ad hoc, it's really defined by kind of one-off and reactive ways of watching new laws and provisions on data come in and then see if they're seeking to implement that. So when you think about what it is that makes data governance passive in the traditional approach, here are some kind of five signs. So time-consuming meetings, long policy memos, we had a muta call this the meetings and memos approach where in order to make a decision or to implement any particular regulation on data and involves gathering lots and lots of different stakeholders over long, sometimes extremely and excruciatingly long periods of time. And then the output is usually a very, very complicated policy memo, very difficult for most folks, especially technical folks to understand. Other traditional signs of passive data governance are custom permissions, varying policies for database, we call this the snowflake approach, where policies as applied to specific databases are kind of custom created. And they are just very far from standardized and then creating new copies of data to satisfy compliance, privacy concerns. So that looks like something like a group wants to use a particular set of data for a project that data will then be copied and something like, you know, customer ID number might be master or hidden for that group. So these are all signs that data governance is being implemented in a passive way. Another good test is just to ask yourself, how long does it take between when your organization collects data, when that data can be accessed and used, is it day, days, weeks, or months? I think for virtually every organization we come into contact with, the answer is almost always the too long, it takes too long between when data is collected and when it's able to be used. But there are many organizations we've dealt with where the answer is in the many, many months, sometimes approaching years. And of course, the longer it takes from when the data is collected to when it can be used, the less valuable and the less reflective of the present, that data actually is. So how can we move away from a passive approach to data governance? What can we actually do to try to address some of these, those three problems I talked about, I think really define data governance landscape? The answer is automated data governance. It is the subject of this webinar. It's something that we get immuted, take extremely seriously, and we spend huge amounts of time figuring out the best way to do this, and I'm just going to walk you through really what this is and the best way to implement. So to start with, what exactly is it? So we have here a definition that automated data governance is the process of proactively applying rules on data to ensure compliance and drive analytics. I think more broadly a good way of understanding automated data governance is to understand data governance as really being driven by two competing objectives. So on the one hand, there's an objective of controlling data, controlling privacy, security, and the compliance aspects of that data. On the other hand, there is the objective of the actual data analytics program, which is to use that data. And so these two objectives, governance and turning data into insights, which is where the return on investment comes from these data Linux programs, these are competing. And so it means that at the core of data governance is this tension created by the problems we just discussed, and automated data governance is a really, really good, in our view, the most effective way of resolving some of this tension. And really realigning governance programs and data analytics programs so that they're all effectively positioned to achieve the same goal. So automated data governance has five pillars, a walkthrough, each of those pillars. So it starts with any tool, any data, so any tool should be able to be used with organizations data. Any data should be able to use with those tools. No copies are made of that data. Any level of expertise can interact in this compliance and governance environment, and all policies are stored one place. So a walkthrough of each of these in more detail and provide some more context and why they're so important. So to start with pillar number one is any tool. So we think about creating a long term effective data governance program. We look at traditional and passive approaches to data governance where there's something we'll call vendor or tool lock it where data governance program will come in and they'll say can only use these tools and if you use something outside of these tools, you're not going to be compliant. It turns out that over the long term, that's actually a recipe for non-compliance. The data science and data analytics environment is very far from static. So it means that it's developing very quickly and data scientists will discover and want to use new tools, some tools today that we might not necessarily even know about and tools tomorrow that haven't yet been created. So locking any data science program or data analytics program into specific tools for compliance reasons is a really, really, really big problem and it's something we see quite frequently. So we need to get out of that. We need to make sure that the data governance environment is tool agnostic. So any tool that an analytics program might want to plug in is supported. So that's the first pillar. The second pillar is any data. So for the same reasons that we don't want to lock in data analytics programs to only using a handful of different tools, we don't want to lock them into using any specific types of data. Data comes in all sizes, types, different storage technologies and we need to make sure that our data governance program is agnostic to all of that so that, again, we're not locking in data science and analytics programs and that we're ensuring that flexibility and adaptability are actually built in for the way that these programs access and use the data. So pillar three is no copies. So I mentioned earlier how copying data might work for one specific project might result in hashed customer ID numbers. But the problem is that over the long term and copying data for compliance reasons in the short term does seem like a good solution. But it is incredibly passive. It's incredibly reactive and it doesn't scale. One of the reasons it doesn't scale is because when you take that approach and we've seen it many different times. Over time what you have is a massive amount of copies of data that are just kind of floating around the organization are very, very, very difficult to control or simply even to understand. And so this makes updating policies or keeping track of how policies are being implemented very difficult when you have all these different copies floating around. So it's incredibly important when we're talking about moving away from this passive approach, it's incredibly important to make sure that everyone is looking at the same live data and that we're not creating new copies that are just kind of accreting over time. I should say when we talked about realigning data governance objectives with data analytics objectives, this I think is a really good example because no copies actually means better data. It means more current data. It means data that is going to tend to be more valuable for these programs. So pillar number four is any level of expertise. So frequently, I mean just in the conversation we've been having today, we've touched on all sorts of different topics and it requires good data governance really requires a huge mix of different levels and different types of expertise. And so when we have these environments, many passive and traditional data governance environments are incredibly segmented and they can be incredibly hard to understand. So the technical folks don't really understand the policy or where it's coming from and the policy and the privacy and the compliance folks don't understand what's happening technically. So when we think about actually scaling and realigning these competing objectives and scaling data governance in automated fashion, what we need to make sure we're doing is enabling any level of expertise, technical, legal clients to be able to interact with each other and really collaborate to make sure that this environment is able to ensure that data is protected and to do so over the long term. So the last pillar is one policy in one place. So we talked about policies at Snowflakes under the traditional passive approach to data governance and we simply just can't have a sustainable efficient data governance program if policies are stored all over the place. If there's not no single sheet of music, if there's no one single source of truth or what policies are being implemented where and so as we move to a way of automating data governance, we need to make sure that all policies are represented one place. And so that makes it easy to understand what's actually happening in the current moment and it also makes it very easy to update and change policies over time. And this is clearly, you know, we talked about how complex and how hard and time consuming it is to integrate different policies that are coming from all the different laws on data. And I think there's one thing that's for certain is that this environment is extremely dynamic and it's going to change. And so we need to be thinking about how we can set ourselves up to update these policies as they change over time. So if you've been listening to this presentation so far and asking yourself, you know, I'm thinking automated data governance sounds good. It sounds like something that might be able to help my organization. And the next question is, what can you actually do to get closer to being able to implement a version of automated data governance? So we really have five questions for you. And I'll walk through each of these five questions. And what they're geared to do is get you thinking about basically the things you need in place to be able to effectively implement automated data governance. And if you can answer these five questions, you will be, I think, extremely close to being ready to actually automate data governance processes and your data governance environment. So to start with, and this is I think really just kind of like the threshold question. This is perhaps the most important is just ask yourself what process governs how an analyst receives new data in your organization. So if you can understand that process from where and when new data is actually collected by your organization to when that data is able to actually be accessed by an analyst. So if you can understand all the different midway points, sometimes choke points between them, I think you'll be in a very good place to understand what the current environment looks like. And then to start thinking about efficiencies. So question number two is understanding where your policies actually come from and what rules you actually care most about adhering to. It looks different according to different organizations, according to their geographic footprint and what their priorities are, but understanding where all these policies are coming from is really key. And it's also key so that you can make sure they're accurate and up to date as regulations change. Third question is understanding where your data is and who's responsible for it. So what different groups or units are tasked with collecting, managing, performing the ETL on data. There are all sorts of different ways that data is gathered and piped across organizations and understanding that landscape. Very similar to question one, but understanding that is really key. Fourth question is understanding how data is used, cataloged and tagged. And so we actually at Emuta think about use, cataloging and tagging each as different basically metadata fields. And if you can understand that, if you can understand the purposes your data is used for, you can understand the tags like sensitive, what you consider to be personally identifiable. If you can understand all of that, you're in a very good position to write some very simple policies that can take advantage of all that metadata and again can scale and scale over time. And then lastly, looking at what types of technology you currently use to share data faster and control data more effectively. So understanding basically what types of technologies you're currently implementing that might help make the data governance landscape your organization. So if you have even just a basic conception of the answers to each of these five questions, I think it will be a long way towards implementing the approach we discussed today. And so on that note, we actually have a white paper focused on automated data governance bears the same title as this webinar, you can download that at the link here. I will leave this up for another second or two and then otherwise I'm going to turn it over to Matt vote and of course both of us will be here during the Q&A session. Answer any questions that Matt over to you. Thank you. Yeah, I appreciate it. Great. And I've been participating in some of the chats that have been going on. Really appreciate everyone's participation in some of these conversations and as I walk through some of these examples of what customers are doing to automate their data governance. I'll probably tie in some of the questions, some of the comments and ideas and thoughts that we had there too. So let's let's move forward. What essentially I wanted to do was to walk through a few examples of of customers and interactions that I have of what people are doing to address a lot of these challenges Andrew spoke about. Let's let's move forward one. So, Cognola is an interesting company. They're a medical startup company out in the West Coast. And what they do is build and provide analytics for early diagnostics and personalized therapies. Specifically for for children. And obviously their data set is highly governed, highly regulated. It is not just healthcare data, but it's healthcare data on minors. And so there are some very well understood regulations that govern their data. But as Andrew spoke about things like CCPA continue to come up and even data they collect on constituents outside California. They continue to evolve. And so, even though they have some well understood regulations, there's a lot of variability and policies that need to be applied to their data. And so they took kind of these of the pillars that Andrew spoke about the two that they really keyed it on where this concept of any tool. They were using tools of choice they were using predominantly Jupiter notebooks and just seal access to run analytics against their data, but they moved actually to be using data bricks and Tableau. And so they needed the flexibility to support their existing tool set, as well as move to new tool sets to consume data build models etc. And as well as this concept of no copies of data. They needed a way to give different constituents secure access to the same data without making an extra copy of it, which is what they were doing in order to develop these models. They had to make a de identified copy of the data set. They had to go through HIPAA controls to satisfy what's called safe harbor. And they were ending up with essentially a month long process to de identify data. And there was a new policy that had to be implemented. They again had to rerun this static process to make a de identified copy of data that then they could hand to their developers to develop these models. And so what they put in place was such a centralized platform that allowed their developers live access to the data as it came in. And then customized dynamically de identified access to that data with whatever tool they were coming from so they could switch between tools, get the same access to data as they would have had to that static copy of data set. But again, it was dynamically removing a PHI in HIPAA sensitive data for them to comply with safe harbor. They were actually restored in Aurora. They actually were using RDS instances. They switched data platforms halfway through their project. But the impact of this concept of any data as well satisfied their requirement of both moving data systems and moving tooling, but not losing any visibility to data. And so we'll move on to to the next which is a actually an automotive company. And they're based out of Europe. They're a large multinational auto corporation. And of course, as you could probably imagine, GDPR is a paramount concern to them. Their predominant use case was actually looking at telemetry coming off of their automobiles. And they were essentially, if we tie back to what Andrew talked about, the tension of ROI versus governance, being able to get value out of my data versus give a privacy compliance and security is they were limited to the number of telemetry points to collect out of the vehicles, because there was no way to govern the usage of that data once it came in. And so they used a combination of things again for this automation of application of policies to data, as well as some advanced techniques, something called differential privacy, which is a mathematical guarantee of privacy of every record in a data set. And because of these advanced privacy techniques available to them, as well as this concept of purpose based controls, which I'll kind of get back to in a second to tie into a conversation we had in the chat. Those two things together allowed them to essentially 10 fold increase the telemetry points they could collect off of a car, which allows them to get more data off the car while complying with GDPR. And now they can do some advanced analytics, specifically something like this re identification while doing analytics on vehicles, again, completely anonymized data, going into their analytics. And now they identify a car that's an anomaly, something's going to happen. You know, a sensor is going wonky, which I think was the technical term, they needed to be able to identify the owner of that vehicle. But obviously the analyst doesn't have access to the data by data sets, so they needed a workflow to be able to re identify data sets post analytics. And again, kind of this central platform of that all being one copy of data with customized views based on people's roles and permissions and attributes, allow them to automate that process. The next operation we'll talk about is LMI. So they are a government contractor, and essentially they are in the business of solving problems with analytics with government data. They work with most defense organizations within the US military, and they are working with, take all five pillars of what we think an automated data governance organization should entail is they were really going hard after all, all five. And so they have data from everywhere, not just within their purview, even data just being sent to them or data they can connect to external to their four walls of their data center. But then being able to provide that dynamic view of the identified data to that data where it sits again not copying the data, and getting access to that from any tool again, being able to say to one organization that it doesn't even own the data, they can access the outside those walls, but still maintain governance and privacy and security controls at the policy level to enforce those. Now, the kind of any expertise pillar, they have folks who are authoring policies that don't know SQL, they don't know code, they don't know Python, and they need to be able to author policies that are enforceable across all connected data and have those policies be understood and interpretable by people who on the consumption side the bi and data analysts, they need to be able to understand the policies that are being applied to data they're consuming, which again to kind of tie back into a conversation we had in the chat about where is where does the bi and analyst community fall in responsibility, more in kind of this governance world. And then obviously they need a central policy location, any one place to author policies to govern and control all their data across these six authoritative systems, and consistently apply policies to data, no matter where it's coming from in a matter of the tool it is going to whether that data is on premises, or in the cloud. One policy one place from an enforcement perspective, one of the key concepts to not lose there is this consistency of application, or we want to tell a system from a policy or governance perspective look we need to mask these kinds of data. I need to go back to pillars one and two that needs to be regardless of the tool regardless of the data that policy needs to be applied. And certainly, I want to talk about a, there's a bank, and it's a international bank, and they had to, they essentially their use cases they built a big data lake. They were really proud with this kind of use cases they were going to bring everybody together, all the analysts all the data scientists all the data builders. But the problem is nobody could use the data lake because they could not put data in, because they couldn't effectively govern data coming out. So frankly, they did not have a data lake. And so, what they were able to do is provide the self service access to data, begin with with no, no custom tooling. All the data in a single place with customized views of no extra copies of data. But some of the biggest aspects to them of automating this governance was again this any level of expertise. They wanted the central governance and legal team to be able to author policies in the platform to have them be enforced in the data in real time, and have both the data owners and the people consuming data understand the policies that are being applied to their data. As those policies are actually enforced in real time. So it's not just about policy authoring and policy management. It's about policy enforcement as well. And so they're, they're taking this across 8000 different data sources, abstracting these policies out of the data and out of the tool to for the very first time grant access to data to people and organizations that previously did not have access to data. In many cases both HR marketing, where they were wanting access to data, but there was no good way to govern that usage of data. And the big thing they put in place. And this is a concept in the chat, we talked about is just because I can use data doesn't mean I should use data. So how do we control. So there's a couple things there's how do we enforce. How do we control and how do we audit the should of data access GDPR tries to address this with this concept of appropriate data usage. I mean, you can only use data for the reason it was given to you. Andrew spoke about it at the beginning CCPA has language about this HIPAA has language about this. It's kind of easy enough to put into a database that now I collected this data for this reason, and therefore people who are accessing this data should know they should just know because it's part of our policy that you're only accessing data for a specific reason say marketing. But that's well and good, but how do you actually enforce that. And so you really want to be looking at platforms that can enforce purpose or sometimes it's called context as part of analysis, or as part of someone's job or someone's day to day activities and then they as an analyst from working in project a for marketing group, they are in the morning and I'm working in the fraud group in the afternoon on the same person, but obviously my context has changed. And so how do we capture context as part of the analytical process. I think again that's kind of the marriage of where does where do bi developers and analysts fall in the process of governance. What is my context, what is the context of the analyst or analytic and being able to capture that. And so what this bank did is they essentially, they're putting a rule on all data saying these are the caught five to 10 different purposes or reasons you can use data. You have to declare you're at least working on one of these purposes before you have access to the data. So it's available for these 10 purposes, but you have to at least tell us when you're accessing data for this reason. And then be captured again from on the very end of the spectrum of compliance is auditing. And so we need to be able to then capture or audit and then report on the purpose of data usage. And so those are four examples just for use cases I wanted to walk through of organizations that they have successfully implemented an automated data governance approach. And some of the key pillars for them again, it's not that everybody did one to five certainly they focused on different pieces of that that were important for their processes. But one of these five were certainly a component of it. So from there, I think we will kick it back to either Andrew or Shannon. Matthew, thank you so much. Andrew, thank you so much. These have been great presentations from both of you. We've got a lot of questions coming in and just to answer the most commonly asked question. Just a reminder, I will send the follow up email by end of date, Thursday. With links to the slides links to the recording and anything else requested throughout here. So diving in here. And if you have questions, feel free to submit in the bottom right hand corner in the Q&A section. So who defines the data policy state of governance team or the privacy team? And how do we measure compliance with the data policies? Yes, so I'm happy to take that. This is Andrew, Matt, you can jump in and correct me. I think I'm going to speak. So I think honestly, that's a wonderful question. I think part of the reason why it's such a good question is because the answer seems to change from my perspective on an organization by organization basis. If you look at exactly what it is that privacy groups are responsible for, that tends to vary pretty wildly over different organizations in the same in terms of the governance. So I assume that whoever is asking that question within the organization, the boundaries between what constitutes governance and what constitutes privacy are relatively clear. But that's not something that we're really seeing in the market. So there's just a lot of different overlap, a lot of different variation that we're seeing. I think the one point I would make in response to that is and one of the things I think we see that different organizations really struggle with is again involving all different types of expertise. And so one of the things we frequently see is so for this example question, there might be an organization where the privacy ends up being the most legally focused folks. And they're the ones that are the most familiar with the memos and meetings approach or the complex legal ease that gives rise to all these different policies on data. And the governance team might be on the more technical end. And so it's very common for us to see this kind of split. On the one hand, the folks thinking about the laws, on the other hand, they're the folks thinking about the technology. And so I think the key there is to make sure they're all talking to each other and that they're all effectively collaborating because if they can't collaborate, if there's that division, you're going to end up with lots of lots of different bottlenecks and barriers to progress. And, you know, I'll throw this out to both of you. You know, what do you think about metadata management? I think it's crucial, especially if you're looking at automating this process. Because if you look at all the disparate data systems that are coming in, even across, across schemas and data systems, nothing's going to be called the same when you label it. Think of the column names. And so it's really important to have a robust metadata system in place where you can effectively categorize or tag data. Think of like social security number, a phone number, a name. To give it multiple things, you could say this is a name, even if the column name doesn't match up. This is a credit card number where the column name might be CC. You couldn't tell by the column name, but you can tell by the data. And so you need a place to be able to have kind of commonality of metadata and then how that translates into automating governance is you need to be able to write rules against that commonality. You need to be able to define globally across all connected data to say this is how I treat names. This is how I treat PII. This is how I treat specific entities of PII credit card numbers, addresses, etc. Dealing at that metadata layer, rather than at the physical layer. Yeah, and I would just I think that's exactly right now I would just add that a really key component of that is thinking about metadata as being hierarchical. And so if you structure that correctly, you can have kind of like a top level. So you have parent tags and then child tags and what you can do if you can have very, very granular metadata attached to all the raw data. And with the parent tags, you can use that to create just a few very simple, very easily understandable rules. And that's not something that we found is really key in terms of effectively having the policies mapped to the data, but also making sure that the policies themselves don't kind of spiral out of control and become too complex. Yeah, because of what you get with that is the ability to write their author and enforce a policy that says, this is globally how I deal with PII. Everybody has access to PII except for these people, except for this purpose. Right. And so you get that by having robust metadata management. See this question came in when, as you were giving your examples, your example on linkages was very interesting. How does one consider all possible associations to control for disclosure of PII? That might that be that might be for me. That was a question focused just on link attacks more broadly. Yeah, I may have mixed that up, but not a problem at all. So I mean, I think so. I think the first step. I mean, that's a great question. I think the first step is understanding. Excuse me. Sorry for the coughing today. I think the first step is understanding that it's impossible. Again, we talk about privacy and security protection being an art, not a science. If you're going to use the data, you are going to create some sort of risk. And so I think the first kind of starting point is just to understand that it's impossible so we don't want to have unreasonable expectations. And then I think once we kind of have grounded ourselves in the realities of risk management here, I think it becomes a little bit easier to say, OK, what is it that someone wants to use the data for? Does it justify any potential risks that we think might be incurred from it? And then also kind of just what can we do as a prudent matter to reduce those risks? So a lot of different regulations encourage what's known as minimization basically means that you should be limiting the amount of data that people get as a default. So you should never be giving someone more data than they actually need. And so there are a variety of different ways to put minimization into practice. You can only use statistical samples of larger data sets. You can ask or hash particularly sensitive values. There are a whole host of techniques you can actually implement. But I think more broadly it's really just about thinking about for one specific purpose, for one specific task, what data is really necessary and not going beyond that. So thank you for the question. It's a wonderful question. Yeah, a lot of great questions coming in. And if you do have questions feel free to submit them in the bottom right hand corner. We've got time for a couple more here. So one of the five pillars is to have no copies. How do you feel about using replicated databases to limit users permission accessibility to the raw or sensitive data as well as removing server traffic on the original database? Yeah, I think from the kind of infrastructure architecture side of my brain, I wouldn't necessarily consider a replicated database, a copy of data. Often I think of replication for availability, disaster recovery, etc. And so I think kind of the security rules still kind of come into play often in that arena. But I would, I also often think of kind of transactional systems. I think like the one that this question is trying to get after from a load perspective, having some relief is to create like a replicated a data set. And from my experience is that's the predominant, predominant load of data is for analytics. And so having a basic a method of data for analytics, we don't consider that a copy of data for governance. I think what we're really trying to hone in on is reducing the need to make a copy of data, basically a snowflake copy of data for a specific role of governance. So like HR and the credit fraud group shouldn't get two different copies of data just because they need a different view of it. Take this replicated data set as your quote unquote analytic data set to point people to to reduce load on the production database and point your governance tool, your privacy tool to that database and provide those customized views live against that data set. So I think that kind of fits squarely in the use case of no copies of data for the purposes of governance. How does it security come into the picture. We deal with security a lot. And I think they play a central role. And what we're seeing across organizations and I'd love to kind of hear this from folks on on the webinar maybe in the chat they could chime in too. But more and more, we are seeing the, the quote unquote problem of both privacy and compliance. Start to fall into the bucket of security in the, the CISO organization. The governance has completely landed there, often still stays in the CDO, the chief data office, but who owns the risk was a thought I had a we saw above that often is now starting to land in in the CISO office. So the security folks need to be a central part of any, we think, wholehearted and holistic governance approach. It's got to be kind of a good marriage between data offices, analytic offices, security offices, and then obviously the people who need it. So we think they have a central role, long with that answer to say, they're very crucial. Indeed. So, you know, I'm, I'm wondering if we can slip in just one more and less in a minute here, you know, what is the necessary first step I should take care of out of the three domains privacy, security or compliance. That is a very, very good question. And I'm not sure. I mean, so I think so you can tell me if this is a cop on answer, but I think at the end of the day they are all the same. They're all about understanding and better controlling your data. So I think they're really all about control. So if the question is, you know, I'm looking at this landscape where have so much I need to comply with. Where do I start? I think I think that the good way to start is just to understand to go back to those five different questions we listed. And just to look and understand what's the process for how my organization collects and gives access to data and to make sure that that data is thoroughly controlled at each at each step. So really it's all I think the end goal for all of them. Yeah, I think my answer is going to be similar. I think they all actually informed the other. The requirements of privacy can come out of security and regulations for compliance. You fall underneath and vice versa. I think that's actually a through a play. I think you can't completely address completely address security. If you're not also addressing privacy and compliance. Well, that does bring us to the top of the hour. Andrew and Matt, thank you so much for this fantastic presentation and thanks to all of our attendees for being so engaged in everything we do and all the great questions. And again, reminder, I will send a follow up email to all registrants by end of day Thursday with links to the slides and links to the recording of the session. And again, just thank you everybody. Thanks to Muta for sponsoring and hope everybody has a great day. Again, Andrew and Matt, thank you so much. Thanks for having us.