 And here we go. Hello and welcome. My name is Shannon Kampen. I'm the Chief Digital Manager of Data Diversity. We'd like to thank you for joining this month's webinar, Straight Talk to Demystify Data Lineage. It's part of the monthly webinar series sponsored by IDIRA. Just a couple of points to get us started due to a large number of people that attend these sessions. You will be muted during the webinar. For questions, we will be collecting them by the Q&A in the bottom right hand corner of your screen or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag dataversity. And as always, we will send a follow-up email within two business days containing links to the slides, the recording of the session, and additional information requested throughout the webinar. Now let me introduce to you our speaker for today, David Loshan. David is the President of Knowledge Integrity, Incorporated, a recognized thought leader and expert consultant in the areas of data management and business intelligence. David is a prolific author regarding business intelligence best practices as the author of numerous books and papers on data management, including Big Data Analytics from Strategic Planning to Enterprise Integration with Tools, Techniques, Nesquil, and Graph, and the practitioner's side to data quality improvement. His best-selling book, Master Data Management, has been endorsed by many data management industry thought leaders. And David is a frequent invited speaker at conferences, webinars, and sponsored websites and channels. David is also a program director for the Master of Information Management Degree at the University of Maryland College of Information Studies. And with that, I will turn it over to David to get today's webinar started. Hello and welcome. Thank you, Shannon. Can you guys hear me okay? Can you hear me okay, Shannon? Yeah, you sound great. Okay, great. Just want to make sure. So anyway, thank you very much for that introduction and I want to thank Dataversity for giving me the chance to chat today. And thank Idira for helping me with the topic and give me the opportunity to talk about data lineage. I'm actually a big fan of data lineage. It's something that even though I'm not even sure there was a name for it, I was talking about it 20 years ago. And so it's nice to see that the concept has taken hold and that there is actually some naming and some some some agreement about some of the main concepts data lineage. And today's talk is a straight talk about demystifying data lineage. So let's get started. The first thing is, is where does data lineage fit into the organization? I think part of it is, is as a as a structural edifice that supports data governance from the enterprise perspective. But we should really talk about what is motivating data governance at the enterprise perspective. And there are some key drivers. The first is being able to use data better. So improving information utilization, better data quality, improving interoperability, especially as organizations are transitioning from their conventional on premises monolithic systems to to what is apparently an evolving hybrid on premises slash multi-cloud information enterprise. The more that we diversify the structural layout of our information enterprise, the more difficult it is to manage interoperability. So enterprise data governance is coming in to help with that. Number two, operating more efficiently and effectively improving technical operationalization, reducing operational costs associated with the business processes that we're lying on on data and information, especially when it comes to the ability to address issues that crop up, addressing them downstream, multiple times, being able to get a better understanding of the perspective of information use that spans the different processes and technology implementations and systems within the organization. And so being able to streamline this design and development with the guidance of an enterprise governance structure helps eliminate the potential for issues to crop up along the way that essentially allows there to be an improved mechanism for interoperability, interoperating across different systems. Number three is improving accountability. Business accountability is largely in the past had been divorced from technical accountability, but the more we look at information as being the the fuel that drives business processes and improving the way that organizations run, there's also technical aspects of compliance, compliance with data use agreements between different organizations, compliance regulatory demands, such as protection of private data, protection of sensitive information, like intellectual property compliance with reporting expectations that are coming from new regulations and laws. And finally, number four, getting better results. So improving the business results largely that's being driven by analytics, but analytics, of course, is not something that's done in a vacuum and depends on high quality usable information. And therefore, instituting governance way of establishing the credibility and the trustworthiness of the data that's being fed into our analytics and business intelligence and reporting processes. So there are a lot of key key drivers for information governance, data governance within the enterprise. And what are the objectives of a data governance program? Well, essentially, there's four key objectives. I mean, there are more objectives in this, but we can focus on these ones, particularly number one is understanding and interpreting business data dependencies. In the past or in ancient times from the computer world, we built transaction processing systems and the data collected as a byproduct of the processing and transaction were specifically to support the execution of those transactions. And it's only within a little patent decade or so that we've collectively understood the value of the information itself as opposed to looking at the data as a byproduct of an operation transaction process. And so being able to understand what the dependencies are between the business drivers and the use of information, that's number one, number two, defining and approving data policies. Of course, we could limit this to or focus on the concept of compliance as a source of data policies. But in essence, every business process in the organization is driven by a set of business policies, whether it's something that has to do with general accounting procedures or whether it has to do with obliging certain restrictions for sales or whether it is in the development of certain types of marketing campaigns. These are all cast in terms of business policies that have relationships to the way that information is being used. And so we want to be able to use or leverage our data governance framework for defining data policies and approving those data policies. And ultimately, number three, developing procedures for operationalizing those data policies. So it's one thing to be able to define and approve a data policy through the use of a data governance council. However, the policy is meaningless if there's no infrastructure within the organization to enforce compliance with those policies. So the number three goal or objective of the data governance program is to develop those procedures and ultimately adopt the technologies that are necessary for operationalizing compliance with those data policies, and finally, continuously monitoring compliance. And that's not just the operationalization of that, but being able to institute a means by which you can measure observance of defined data rules and report on compliance to those defined data rules to be able to identify where the opportunities exist for continued improvement. And that, again, cycles back to understanding and interpreting business data dependencies, et cetera, we do this cyclically and iteratively as we look at different business processes and different externally defined dependencies and directives. And that being said, one of the key technologies that seems to have evolved to support data governance or rather what I call here is powering data governance data lineage. Data lineage methods help to develop a map of the enterprise data landscape in a number of different ways that provide the holistic description of each data object sources and the pipelines through which the data asset is produced and the transformations that are applied along the way in those information pipelines and the ways that those data assets are accessed and the controls that have been instituted to monitor and measure compliance and essentially any other fundamental aspects of information utility such as data quality characteristics, the data policies that are associated with it, the data elements that are used, et cetera. And in fact, data lineage is a kind of a glomerative technology suite associated with understanding how information flows across the enterprise information. And we're going to look at data lineage in a little more detail largely in terms of different three different aspects of data lineage. The first one is business lineage, the second is technical lineage, and the third is procedural lineage. So we combine three different aspects of corporate metadata. Business lineage is associated with the semantic aspects of tracing the meaning and usage semantics of data. And so an example of this might be a data element definition or data concept definition. The technical lineage are the structural aspects of data element concepts and how they're used across the enterprise. And the third aspect is procedural lineage which traces the journey of data through the different systems and the different persistence areas and provides an audit trail of all the modifications and changes and transformations that are being applied along the way. And so we've got these three different aspects that all combine under the heading of data lineage. And we're able to look, I'm going to talk about each one of these in a little more detail in a second. First, I want to give a little bit of a demonstration of some of the ways that technology supports these different aspects. So an example might be glossary or a data glossary provides information about the names of data elements and the specifications and maybe some definitions and maybe calculations that are associated with that. So I don't know if you can see that on the other right-hand corner of the slide. So that might have to do with the business lineage aspects. On the bottom we have business process model. This looks at the flow through which data elements go to get from their origination points to their final locations. And then the policy management that gives some different aspects about where the sources are of the specifications that are necessary and why those policies are in place and what's needed to be done in order to observe those policies. So just some examples of how some of the tools support these different things. Let's start with business lineage and the way that I'm casting this, the inventory and the description of the business characteristics of data assets that are captured within a data catalog that accumulates information about each of those data assets. So maybe I'll take a step backwards here for a second and talk about what a data asset is. In the past, most of the time we talk about structured data that lives within an existing data system. So an example might be a relational database and there are tables within that relational database and each one of those would be structured data that has some clearly defined characteristics. There are columns associated with the different data elements that are in that structured data set. However, as time has gone on, we're starting to look at the use of a number of different types of data, not just limited to structured data that sits within a relational database, but rather there's different types of structured data assets such as CSV or TSV files, extracted data that's got separated by characters or maybe extracted data that has fixed-length columns. We might have semi-structured data assets that are, for example, that might be XML data documents that have tagged sections in a hierarchical structure. Similarly, we can talk about JSON, that's another semi-structured object notation. There are other types of those object notations that have this hierarchical tagged sections, not necessarily structured as a two-dimensional matrix like Excel spreadsheets or database tables, but there are different types of data assets. We have unstructured data assets like documents and images and those types of things. In our environment, one of the main goals that governance seeks to employ is the ability to make these data assets usable for more than one purpose, so being able to provide accessibility to a structured table in a comma-separated value format for analytical purposes as well as for processing purposes. Well, the individuals who want to be able to use that data need to know what that data asset is, what's contained within it, where it is, how to get access to it, who the owner is. There's all sorts of object metadata associated with that data asset, and a lot of these characteristics are captured within a data catalog. So the data asset description, what's inside, in fact, a name of the data asset itself, a business glossary, what are the business terms that are used within that data asset, where are their definitions, and where is that data asset managed inside an on-premises relational database as it's sitting in a object store in a cloud environment. What are the levels or the characterization of a level sensitivity of the data that's within the data asset? Does it contain individual names and social security numbers? Does it contain protected information that has corporate value? And then who has the right to access it and what are the different methods of ensuring against unauthorized access? So the business lineage contains a lot of the information that would be potentially characterized or collected within a data catalog. The second component is the technical or what we might call structural lineage. At this lineage catalog, which data element concepts are used and how the data element concepts are manifested as data elements within specific data assets. So an example of this might be we have a one database that contains a field called customer ID and we have another database that contains a field called customer number. It turns out that perhaps those two named columns are actually different named data elements associated with a data concept, which is unique customer identifier, which has a definition, which is the unique identifier used to represent each customer's account. But it's manifested in the two different tables using two different names, who has two different data elements. So the structural lineage talks about what are the data elements, which data element concepts do they map to, what are the structural aspects of those data elements. And it's not limited to static data sets. So for example, we have data in motion. An example might be a REST API request, where there's a collection of parameters that are being sent in a GET request from a browser that's asking for the contents of a selection of data elements from a data resource. And each one of those data elements has a name and has a value that is being presented through that interface. Well, that's data in motion, as is the response to that request, where even though the data elements might be named one certain thing within each data asset, it might be standardized to be presented back to the requesting actor with a particular set of named fields in a JSON representation. So it's not limited to static data sets like a database table. It might be the manifestation of a data element concept in a dynamic context, such as the delivery of a report. So we have a BI report, for example, where one of the columns is the result of a calculation that's applied to a number of data elements that's in a source dataset. Well, that column in the report has a name and it has a definition and it refers back to a data element concept. Even though there's no static dataset that contains a manifestation of that data element concept, it might be a rolled up value, but that rolled up value still may be defined and have metadata associated with it. So another example of structural lineage is our data element concepts that are produced for reports or feature sets for analysis. So the second aspect or the second facet of lineage is this technical and structural lineage. The third is what people may commonly think of as data lineage, which I'm referring to here, too, as procedural lineage, which are a tracing of the flow of data from the original source to its different touch points and use points and persistence points. So here we want to identify the original introduction of data elements, document the process flow for the data elements that are associated with certain types of compliance with data policies, map the data element used to the business application touch points, and then determine where data instances are either created or transformed or updated or maybe just read or touched and document those transformations that are being applied. So in this image we've got kind of a mapping of a process flow, but then we've circled certain areas where perhaps there is some type of transformation that's applied to that data. And here it's, this is the flow or the dynamic movement structure of how information moves across the organization. And so when we put it all together, we've got these three different aspects, the, we've got the, show and go back to my other slide here for a second, we've got our business lineage that gives us our semantic aspects, you know, where the meaning came from, what was the origination point of the specification. We've got the technical lineage that talks about how each concept is mapped to a manifestation and then we've got the procedural lineage that looks at how information flows across the organization. So there's a number of different benefits of doing this management of data lineage. And I'm going to give a quick run through here and I'm going to talk about each one of these, analyzing data dependencies, validating semantic consistency impact analysis, data quality root cause analysis, integrating data controls, enforcing regulatory compliance and protecting sensitive data. And this is just a subset of the different benefits that you can get from data lineage. And the result here is that ultimately you get better data quality and better data quality will then lead to better business decisions. And that is, is where we cycle back to the, the integration or the, the, the motivations for data governance. Okay, so let's start with our different types of, of, of beneficiary analyzing data dependencies and one of the things that I was glad that, that when Shannon introduced me, she made reference to being a program director at the firm master's degree program here at the University of Maryland. One of the things that we're looking at is information risk and understanding how certain scenarios within an organization will lead to the introduction or risk that people may not be aware of. So unexposed data dependencies, introducing, introduce risks in ensuring the high quality, usable data. Not just for ensuring high quality, but even in other areas such as protection against non-compliance and, and observing, auditing demands and those types of things. For example, reports and dashboards and analyses may appear to be derived from data sets from a number of different isolated systems, but in a number of different cases, there may be a chain of processing that originates with the data taken from a shared data source. So in some cases, you might find this familiar that you may be pulling data from, from two different operational systems. Assuming that those systems are, are executing, that their execution is segregated and that they're not, they're not being dependent on either one, but then they find that there's another system that is feeding data. That is loaded into each one of those systems and creates a dependency that you might not be aware of and, and not knowing that that dependency was there may lead you to make certain assumptions about, about the definitions or the use of the data in one of the data sets that may be mistaken. And so if you find that there is this, this type of dependency, you may not realize that it's impacting or biasing the type of analysis that's being done down. Another example is where multiple pull data sets are populating using data from distinct systems yet structurally and semantically those sources are identical. So I'll go back to the example I used before, which is in one data set, you're using this column called customer number, another one you're using a data element called customer ID, not realizing that they are referring to the same core data element concept, which is the unique customer identifier and when, because you may not know that you might end up merging the two data sets with a presumption that, that those two data elements are distinct and therefore you, you could not merge some of those records into a single view. But, but knowing the lineage will identify that those are actually structurally and semantically the same and therefore that frees you to, to apply certain transformations that you might not have thought to do that without that information, without that knowledge. So that's one of the, one of the benefits is, you know, identify and analyze what these data dependencies are and what the relationship is to this, the, the fidelity of the business process that you're executing. Second one is this concept of validating semantic consistency and this is kind of an example that I've used a number of times. It's where we have two different individuals in an organization that kind of start out using a data element concept for two, for different types of uses. So in this case, it's social security number. And in the past, before the, the concepts of identity theft became widespread, organizations would often use a social security number as their unique identifier. I know, I remember using my social security number as a, a member ID for health insurance. There was a mandate or a law that said you couldn't do that. And so there was this requirement to be able to change identifiers. But in, in recent times, we've seen the use of a social security number for multiple purposes. In one case, it's an identifier and the definition is the unique number that's assigned by the social security administration that's being used as a unique identifier. But another way that a social security number is using this, I see this all the time is for authentication. In fact, I had logged into a website for a bank account of ours two days ago and it was asked to ask me for the last four digits of my social security number. Basically, to, to, for me to authenticate that is that I am the person who owns the account. And so that, that concepts social security numbers actually has two different meanings. It's being used as an identifier, but it's also being used as an authentication code. And so in this case, where we want to be able to see where are the places within different business processes that social security numbers being used as an identifier. Because of this mandate that says that you need to stop doing that, then you would have to determine where in your different systems you would have to replace the use of a social security number with a, a newly defined customer identifier, which is a unique number that's assigned by the company as an identifier to uniquely identify a customer. And so what we've got here is the benefit of using lineage is that I don't know without being able to delve into the use of my different systems where the concept of social security numbers being used as an identifier, but lineage will be able to map that. So I could then go to my lineage environment and say, show me all the places where social security numbers being used as a data element concept. And it might pull out a number of different uses in different applications. And then I can evaluate each one of them and see, see, is the mapping of the semantics associated with that, that data element concept being used as an identifier in which case I'll tag it as something that needs to be placed versus being used for authentication in which case I will just leave it alone. So this benefit here of using data lineage is a way of verifying that proper usage of different data element concepts and be able to identify where there needs to be some kind of replacement or mechanism for changing under certain circumstances. We'll talk about that in the next slide, which is impact analysis. That was an example of an external driver or external directive that demanded a change to these information systems within the organization. In that case, it was the mandate that a social security number is not allowed to be used as a unique customer identifier. Now there's all sorts of situations and in fact a lot of them are related to regulatory compliance. A lot of them are even just related to the way the business processes are defined and business policies are specified within an organization. An example might be we change, you know, an airline changes the mechanism by which an individual customer can redeem their mileage for airline tickets or the way that a customer account level is defined in relation to their annual sales volume. So those are business directives that have impact to the specification of how data is being used. And so they're external drivers, but what that means is that in the past, I might have, you know, for example, United, I just got an email that said that they're changing their policies and that says that that miles never expire. So there used to be a policy that said if you don't do something after a certain time period, if you don't have any transactions with the company, then the miles that you earned five years ago will expire. You no longer have access to them. Well, if you change the rule that says miles never expire, you've probably got a bunch of systems that are in place to expire miles which now need to be changed. Well, data lineage allows you to do what I call forward dependency tracing to identify the downstream systems that are impacted by changes either to like a business term definition or the specification of a data element or an augmentation or a modification of data element semantics or a change in the way business process flows. So in the example that I just gave with the miles never expire, well, that's a different policy associated in the business process that changed the semantics of what is meant by the lifetime of a value of a data element, which would be the expiration date. In fact, maybe it means that you have to eliminate the references to the use of concepts of an expiration date. So forward dependency tracing means that you can start at the beginning of any information management process or IT process and then identify any location within that process where a particular data element concept is being used or where there's a touch point. Then that narrows the scope of where you need to do an evaluation for whether there's a particular impact associated with the change in the overall set of policies. In the past, this is almost an attractable problem if you don't have data lineage and basically needs you have to look at all of your systems in a comprehensive way and still may not be able to find all the places where there's some kind of specification because it might be embedded as a relationship between the use of one data element that then another data element depends on. And so exposing those dependencies, like we were talking about in slide 11, analyzing and exposing those data dependencies gives us the ability to do this impact analysis. Now, if you've got the ability to do forward tracing, then you also might have the ability to do backwards or reverse tracing. And data lineage helps with root cause analysis of data issues, especially when you've got a root cause that may impact multiple downstream business processes, in which case you have the manifestation of what might be perceived as a data error that might be related multiple times. And yet they're all related to the origination point on somewhere along the line. So in this image here, we've got a picture of what might be a sequence of processing stages and the way the data elements flow from one processing stage to another. And the blue ones are the ones where there's particular use or definition of a data element's value that appears to be a problem at the end of the process. Data lineage allows you to map the information production flow across your organization. If you find that there's a data issue that has been identified at one particular processing point, you can use the lineage to reverse trace backwards through the data production flow to find the points, the different points where that data element is touched and where it's updated and where transformations are applied. And then what you can do is by tracing through that and looking at the values of the data element at each one of those processing stages, you can identify the point where an error has been introduced into that sequence of process. So root cause analysis helps you by instead of having multiple individuals trying to figure out the cause of an error that manifests itself maybe three or four times in different consuming applications, you can then trace it backwards and find that yes, there's actually is one root cause that trickled down to all those three or four different usage points is actually all related to the same root cause. Instead of having four people trying to figure this out, you can narrow that down to just having one person doing this analysis and then doing the termination of what the original root cause was and be able to provide some guidance as to what could be done to address that root cause or to mitigate or eliminate that root cause as long as we've got a mapping of our data flow. So we're starting to look at both going forward for doing our forward tracing for impact analysis or reverse tracing for root cause analysis. We start to look at it from a holistic perspective that says that if we have potential quote unquote problem spots or key phases in the information flow, we can find the locations that are opportune places for instituting data control. Data control is a application of a validation of some kind of policy or some kind of rule that not only will determine or will report whether that rule is being observed but can also identify what rule is being violated and where in the process flow that rule is being violated. So instead of what we were doing before in our root cause analysis, which is we found a problem and now we're reacting to it by trying to find where it happened in our institution of data controls. What we're saying is, well, we want to prevent bad data from propagating that through our information streams. And so if we can institute a collection of validation steps as controls within our key processing phases, then we can identify where there's an issue, when it occurs, and proactively either address it by eliminating the root cause or alerting a data steward when that invalid data value is identified. And perhaps even modulate the sequence of phases to keep their, to prevent their being negative business impact associated with, with what is now known to be invalid data that will be propagating through the organization. So data controls are allowed to validate data that flows through selected processing phases and will generate alerts or will capture statistics or will forward the measurements to a data quality monitor data mart, which can provide actually real time reporting on the status and the quality of the data as flowing through environment. It allows you to, to automate some of the remediation steps so that if you, let's say you have a process that, that where you're able to identify where there's an error in the data that are being used for that process. You might be able to have a notification that pushes data back to the prior stages to track it to the point where, where the, the introduction of the, the data that caused it to go wrong can be automatically, automatically fixed. So an example there might be trying to process a credit card. And it's a very simple example of processing a credit card payment and finding that the, that at one downstream the name that was presented as being the name of a credit card doesn't match what the credit card processing company's belief that the card holders name is. And instead of failing at that point to reroute the request back to the origination point of that transaction and say, we've noticed that there's a discrepancy in the data that you've provided to us. Can you please provide the correct name? You know, that's a relatively simple example because a lot of this actually is already handled. But consider if you have a more complex environment where you've got that you're collecting information that is that comes from multiple sources to ensure proper completion of the business process. So for example, ordering supplies where you've got where your organization is only acting as a, as a front to a bunch of drop shippers who are providing you with catalog information and inventory information. And so you're actually absorbing information from maybe five or six different vendors to be able to fulfill orders. Well, there's a lot of dependencies there and be able to ensure compliance with certain data controls will will streamline the ability to execute those transactions without having a break in the process. So I've leased into this concept of data policies and enforcing data policies and data policies, relatively broad statement of the data policies formulated to reflect externally imposed or internally imposed data compliance requirements. And we can use the different types of lineage to support the enforcement of the data policy. So the business lineage can be used to capture external policy definitions standardized semantics across a different application uses of shared data element concepts. And I'll point to some of the, some of the industry standards or industry regulations that exists for is maybe same in the insurance industry or the financial industry, financial services industry where you've got a specification of a business concept such as net value or future value of some asset. But there's a calculation that is associated with that concept. And then you might find that you have got five different systems, each of which is trying to calculate that concept and be able to have a business lineage that standardizes the definition and the specification of how that calculation is done. We'll ensure standard is standard results and consistent results when you take the values that come out of two different systems, knowing that those, that those policies are applied to both of those systems. Technical lineage was standardized specifications for validation and the institution of audit controls for demonstrating component compliance. So if we look back at, at again at our ability to do tracing and tracing the flow, our ability to standard to enforce data policies combines the specifications. That's part of the business lineage and the technical lineage that looks at how information is flowing. So examples include things like GDPR, the General Data Protection Regulation or the California Consumer Privacy Act where there's a lot of specifications of what constitutes personal information. And there may be different specifications of what personal means between GDPR. In fact, there are different specifications of what those two things mean. And being able to document what the definitions are and be able to differentiate between a validation of compliance with the European GDPR versus the enforcement of compliance with the California CCPA. And those may mean two different things in two different contexts in two different locations, et cetera. There may be a whole bunch of different, different scenarios and situational characteristics that can be embedded within the information flow and the technical lineage. Same thing with 21 CFR. That should be 21 CFR, Part 11. Finger flub that one or the HIPAA privacy role or the specifications of what data elements are subject to the protection as well as the, what are the implications of a violation? You know, what's the business process associated with breach notification, for example? So enforcing data policies lends the use of two different types of lineage. I started talking about some of those regulations that are associated with privacy. And I'm one of these people that believes that privacy is just a subset of this whole concept of sensitive information where this all serves the different classifications of what makes information sensitive within different scenarios and different contexts and different types of rules. And so I broadened this, the requirement there to protection of any type of sensitive data and business lineage allows you to trace the origin and levels of different types of data sensitivity. The example that I just gave is one example, which is that a data element value that is a protected value according to the GDPR might not be subject to protection under CCPA for some kind of reason that has nothing to do with the data element but has to do with the flow through which that data element travels. So the specific example there is that CCPA applies to protection of information associated with California residents for transactions that took place within the state of California, which means that not only do you have to know who the individual is, but you also have to know where the individual is. And that is, it can be captured within a data lineage mapping, a technical mapping. So if you couple business lineage with a procedural lineage for insertion of data protection techniques, you can institute certain types of data protection such as encryption at rest or encryption in motion, depending on what the criteria are or data masking if you are or are not allowed to present certain values, depending on what the access rights are of the consumer or access controls altogether, which is allowed determining who is able to touch what and at what times and under what circumstances. This blending of the business lineage and procedural lineage as well as some of the technical lineage gives you the ability to institute protection of sensitive data. So here's some general questions that data lineage can answer. Number one, it's understanding the landscape of organizational data, what data sets are important, where are those data sets and they may be replicas of that data set and different places. Where did the data set come from? How is it being used? What are the different ways that are being used? What's the chain of custody? What business rules are applied? And then from the governance perspective, how do you identify private information? What's your retention policy for keeping information? How do you classify different data entities within a master data environment, especially if you are doing more of a registry where you're keeping track of an index of your unique entities, but you're not actually merging those entities into a single representation. The quality of the data is it fit for purpose and what are the purposes and under what circumstances are different data quality rules applied and what things have changed and why have those things changed and what are the implications. Some more general questions. And in conclusion, here's some things to think about. Generally data lineage augments and corporate toolkit for deploying data governance. And if you want to institute data lineage as part of your data governance framework, you have to look for the products that simplify the data students consumption of those data lineage mapping. So that's going to be able to have certain types of characteristics such as the ability to enable users to see how data flow through the production cycle or a mechanism for enumerating the data sources for different data pipelines or how you can identify data elements and link them to the data models and excuse me, to metadata for the different data element concepts within a business glossary. Number four, a method for documenting transformation that are applied to data element values or data collections of data elements at different processing stages. So it allows the professionals to look at those transformations across the different data pipelines. Number five is the ability to interoperate with existing ETL tools or data integration tools to allow you to get a broader perspective of the different data pipelines and data production flows and their collective transformations and ways to collaborate around data pipelines, especially if you want to be able to share them or share parts of them and how the metadata is related to the use of those data pipelines. And finally, the ability to display a visual presentation of the less data stewards review the procedural data lineage be able to drill down into each one of those steps to be able to understand what is actually happening in each one of the steps along the way and how you can use that to implement the different types of procedures that I talked about with that provide you with those benefits. Yeah, so that being said, I think we've got some time for questions. And I saw that there were a bunch of questions that came through. So I don't know. Yeah, there you are. Yeah, we got a bunch of questions coming in and just to answer the most commonly asked questions just a reminder I will send a follow up email by in a day Thursday for this webinar. This link to the slides and links to the recording. So timing in here, David, how can the data lineage concept be applied in survey and census data collection and management. That's actually a pretty interesting question. And it's funny because I was just talking the other day with somebody who spent a number of years working in at the at the US Census Bureau, doing survey and census data collection and management and usability. And there's, you know, clearly there's a number of different ways that that data are being collected. And I think that that that because there's so many different processes that need to be in place to do the data preparation for making that data usable and the example that he gave me to do with with essentially with identity resolution and making sure that you can connect information about a single individual when you're getting the data from different sources and because there's different representations and different ways that that those data sets are collected. Sometimes it's manually sometimes it's it's being collected on form sometimes it could be done electronically. Instituting data lineage and be able to trace how the different data elements in different data sets are being collected actually provides you a lot of a lot of information because it will then provide you with qualitative characteristics associated with how the types of rules that you would apply to do your record like across systems. So, you know, other than, you know, answering this question other than saying saying the same way that you would you would apply data lineage in any kind of environment basically revolves around those three different aspects is being able to to capture the specifications and the semantics. It's the structural information and then be able to document the the process flow and the pipelines and what are the transformations that are applied to data at each one of those pipeline stages. That's a general answer to that and any other other typical questions associated with how it's applied and then it gives you the opportunity to define policies associated with data preparation so that you can ensure that that downstream when you're actually doing say your your consolidation into you know where you're doing your reporting that you've got a level of trust in the data that especially integrating the characteristics associated the qualitative characteristics associated with the source or the processes that that move the data from its engineering source to to its use. So hopefully I don't know if that answered the question but hope so. Great answer. So, where do you start when your data sources are legacy systems with complex business rules that nobody remembers anymore? That's an interesting question. I think part of that has to do with with better practices around what what might be called data awareness or grouped into this concept of data awareness where it's you know the obviously the data sets are are an asset or our assets or what I would call data assets. But there's there's absence knowledge about those assets because they weren't considered to be assets when the system was first developed. So there are a lot of tools that are used to do things like data intelligence to profile the data to get characteristic statistical information to do some inferencing about the the contents of the different data elements to understand the structure. You know, a lot of the older systems had hierarchical structures that went out of went out of favor when relational SQL based databases came into play. So being able to infer what the structures are and be able to to use a lot of tools are using machine learning to try to get experience from looking at different types of data sets for which there is some some knowledge and be able to apply that knowledge to to discover to data discovery and data intelligence. David, do you keep track of the calculation or transformation used as part of the lineage documentation and structural lineage especially in BI instances? Let's see. I'm quickly looking to see if I can read that read that again. Actually, it depends on the tool or the product that you're using, but but in a, you know, conceptually, the best thing to be able to do would be to capture as much information you have about those transformations, especially in this this emerging world where downstream data, you know, citizen data analysts are people who have access to end user tools that can be that can apply transformations to data sets in their raw form. But often these these data analysts, they're not they're not experts in in data management, but they are well versed in understanding what they want to apply to the data. And so so in a lot of these cases, they might they might apply sequence transformations to do their data preparation. And want to be able to share that sequence with other analysts so that they can they can, you know, either expand out on the way that that analysis is done or or taking it a little bit of a different direction, but to be able to maintain some level consistency. So in fact, as we are, you know, you'll hear the terms data pipelines or data flows. And, you know, the more we are able to capture about the transformations and the preparation steps that are being applied, the better consistency we're going to get in terms of the fidelity of the results and the and the expectation of the quality of the results. So having better quality results going to lead to better business decisions. Sure. And I think we've got a time for a couple more questions here, David, you know, a question is, please define in a non technical way, maybe using an analogy. The difference between a data catalog and data dictionary is a catalog, a searchable list of data sets versus a dictionary, which would define the elements within each set. Is this an accurate non technical interpretation. So let's let's see a data catalog. A data dictionary is, you know, simple view data dictionaries, a list of data elements and the specifications associated with each data element. So I may have a table and I've got 10 different columns in the table and a data cat data dictionary will list the names of those those columns or those data attributes and then maybe a definition of each each column attributes name and then maybe a specification structure. Maybe it's data type and its length and those types of things that would go into a data dictionary. A data catalog characterizes the data asset inventory. So I have a data table that contains customer information and I've got another data table that contains product information. And in the data catalog, it will link off or will provide access to a data dictionary associated with each of each of their data assets, but it will also contain what I would call object metadata or data asset metadata that talks about, or characterizes the data object itself. So it would say, here's the table and the table is a customer table. It was produced by extract of the data from this transaction system. The owner of this table is Shannon. It was last modified on September 16, 2019. It contains 45 million records. The last time it was updated was, you know, a certain date, the time the extract was performed was certain, etc. And so the data catalog is an inventory of your assets to provide you a visibility to your landscape while a data dictionary lists the data elements in within a data asset. And then the other thing to say about the catalog is it will provide other qualitative information such as this data asset contains the following types of sensitive information. This data asset must comply with GDPR. To be able to view the columns in the data asset, you need to have this type of authentication and this type of access control, etc. So, excuse me, that's a kind of a high level difference between those two things. And I like owning tables. It makes me very happy. So that does bring us to the top of the hour here. David, thank you so much for joining us today. So excited to have you with us this month for the idea webinar and thanks idea for sponsoring, but I'm afraid that is all the time we have for today. Just a reminder, I will send a follow up email by end of day Thursday with links to the slides and links to the recording to all registrants. Again, David, thanks so much. Thanks for our attendees are being so engaged and asking such very questions. And I hope you all have a great day. Thanks everybody. Yeah, thanks so much.