 Hello and welcome my name is Shannon Kemp and I'm the Chief Digital Manager of Data Diversity. We would like to thank you for joining today's webinar, Emerging Data Quality Trends for Governing and Analyzing Big Data, sponsored today by SyncSort, just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them via the Q&A in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share highlights or questions by Twitter using hashtag Data Diversity. And if you'd like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the bottom middle of your screen for that feature. And as always, we will send a follow-up email within two business days, containing links to the slides, the recording of the session, and additional information requested throughout. Now let me introduce to you our speaker for today, Harold Smith. Harold is the Director of Product Management at SyncSort responsible for the Trillium Software product line and co-author of Patterns of Information Management published by IBM Press. Harold has spent the last past 20 years specializing in information quality, integration, and governance products with a focus on accelerating customer value and delivering innovative solutions. He has written extensively on integration, management, and use of information. Harold has been issued four patents in the field of data management and integration. And with that, I will give the floor to Harold to get today's webinar started. Hello and welcome. Thanks, Shannon. And welcome, everyone. Good afternoon, good morning, good evening, wherever you are located. Thanks for joining me today. What I want to talk about is really some emerging trends in data quality. Data quality is obviously not new. We've been in the field for many years working on some of these particular challenges. And there's a lot of things that remain constant over this time. But one of the things I do want to do today is talk a little bit about some of the ongoing data challenges. Look at four distinct emerging trends that we've really seen developing over the last couple of years and some of the approaches to addressing some of the ongoing data quality needs. Obviously in a such like this, this is not going to be completely comprehensive. There's a whole field out there in terms of looking at some of these challenges around data quality. But we do want to highlight some of the things that you really need to be cognizant of as we're dealing with some of the ongoing changes in the data landscape. I think a lot of us have talked over the years about why data quality is so important. Particularly today where we just see so many different applications, so many different analytical tools being applied, business is really trying to drive new insight based on what they know about customers and products, operations. The data that we've collected is really driving some fundamental changes in the industry. And this again is not something that's just emerged this week, this year. This has been developing over the last several years and even over the last decade. But we see more and more of this, really an acceleration of the application of data to ongoing problems. And this is something that obviously does encompass kind of the whole range of business. We see it in terms of certain types of applications, in terms of being able to focus on things like marketing, marketing segmentation, single customer review. But it really expands everything. It expands into operations, how we're handling finance, how we're looking at overall management and the overall business strategy, even managing aspects of legal compliance. So there's an ongoing demand for data in all of these areas that affect every one of our organizations. So these aspects in terms of looking at kind of governing around, governance around the data and data quality, it really becomes kind of top of mind as we're looking at some of these career challenges. And some of that is, the classics are three Vs of big data, volume, variety, velocity, continuing to grow and accelerate. We see ever more analysis. We're constantly looking at the ways that data science is exploring information, that we're trying to drive new insights, tools that allow us more and more dissection of data and segmentation of information. But with all this data, the new tools, there's sort of a dichotomy that comes out of this, expectations on what we can do continue to grow. Yet at the same time, we see ongoing challenges in terms of trust and confidence in that information. We see more indications that there's growing challenges in terms of trying to deal with the quality of that data. At the same time, there's more regulations coming into play around the use and application of that information as well. You consider just a couple use cases as we look at how we're trying to apply data. 360 degree view of customer, does that mean? Get to know me, get to know me as a customer of your organization. I want to be able to interact with you effectively. I want individualized experience. We see this every day when we go into an Amazon store or other web stores online, engaging with organizations that are selling, delivering services to us. We want that individual experience. And organizations are trying to deliver that because that's a way to really gain market share, mind share. But doing that requires a lot of data. We have a lot of that data internally. We have our customer master systems that are pointing to sales data. But increasingly, we're trying to get more third party data. The things that tell us about who you are, age, occupation, whether there's information that should be suppressed. Have you changed address? Have you passed away? Is there do not call information? And then we're also trying to understand how you interact with social media. What information can we learn about you through that? So there's a lot of information being collected just about individuals. And this is obviously a core driver of a number of core use cases. If we take another different one, a little bit more compliance related piece. We see this in anti-fraud, anti-money laundering, usually around more protection of financial assets, compliance regulations, being able to flag alerts in real time or identify and report on transactions of data. Again, there's internal data that helps drive some of this core content. But we see more and more mobile data. I've signed in with a new device. I just did that on my phone yesterday into one of my applications. And that's something that is just you. We want to be able to connect these pieces together so we can make sure that we're having the right experience that we're maintaining the trust of our customers in our organizations. There's a lot of data challenges here. There's a huge volume of information. We have to deal with the real time data challenges. This has to be known now, not a day from now. We have to be able to capture a lot of things in terms of content that we didn't before. What's the geolocation information? Is this time-based? Do I know what this device is or what the browser is that I'm coming in from? So a lot of information to be able to learn about. And as we're dealing with this sort of big data content, it really becomes critical to have high quality information so we're making the right decisions. We're having the right interactions with our customers so that we're getting the right responses. You know, it's decision-making. It's about, you know, am I talking to the correct person? Do I know who you are? Do I know what you've been doing and how you've been interacting with us so I can respond most effectively? And can we start to make effective predictions based on machine learning and AI models and algorithms? But as we look at, you know, the response here, you know, the levels of trust, you know, 35% of senior executives have high-level trust, but that means 65% don't. We have a lot of executives in our organizations concerned about the negative impact of data and analytics on the corporate reputation in terms of how we're interacting with you. One of the recent statistics I found most interesting was that 80% of AI and machine learning projects are now stalling due to poor data quality. You know, you think about all these things that we're trying to do with this data to be effective and data quality is impacting our experience. And that says that, you know, we need to look very closely at what's happening in terms of data quality in the industry. How do we respond effectively to this changing dynamics? What's new that we need to know about so that we're not, you know, just focus on the same old pieces and ignoring things that are coming into play? So what I want to do is really look at, you know, four key trends. And I think it's an important piece here to state that, you know, this doesn't mean that our traditional data quality goes away. Those issues remain, you know, we still have to deal with data quality that, you know, we've established over the years. But there are some additional pieces that we now need to consider. First of those is simply new types of data which have differing qualities that we need to be thinking about. And we'll look at this in a little bit more detail. We have some new application considerations, and particularly here in terms of looking at machine learning and AI related components. What's going on there that we need to be thinking about that may be distinct? Processing at scale, meeting service level agreements, you know, there's always been a need to be able to address time windows. But the volumes of data that we're now seeing are things that we need to be cognizant of as we're looking at data quality components. And then really the fourth trend is the whole aspect in terms of data democratization and really data literacy and some of the, you know, the resource and knowledge constraints and maybe even just how we're thinking about data is something that we need to think about from a data quality perspective. So let's look at new data, new measures. Now, as I said, all of our common data problems still exist. And even as we begin to look at new sources of data and new types of data, these problems still remain. We still have to think about, you know, these aspects in terms of, you know, data formats. Is it consistent? Do we have standardization of our information? Are we dealing with free form fields that have mixed domains of data that we need to think about? How do we parse this out and get that resolved? Are we dealing with usual issues of spelling? You know, things that have potentially been, you know, creating, you know, a lot of ours are core issues. All of these, you know, are ongoing challenges. They haven't gone away. And these are the things that we've established a lot of core dimensions for data quality around. Is it complete? Do we have relevant information populated? And do we have the integrity of information? Is the data unique or not? Does it have the right validity? Do I have the correct values, the right ranges? Does it match up to my reference data sources? Is it consistent? You know, do I see consistency over time in terms of this overall content in it and did it arrive on time? So these things are all components that we want to continue to take advantage of and be focused on. And they are still important. You know, if you think about some, you know, pieces of information, say, you know, call center record, you know, these are things that I can, you know, do some of that checkmark, you know, is it unique? Yeah. Does it have the right integrity? Yes. Is it consistent? Yes. Is it complete? Well, you know, there's some aspects here that, you know, a file may be complete or appear complete, but does that really represent the whole set of information I need to look at? So we need to begin to think about how we focus on completeness of information and the validity of information, you know, is it, you know, a particular piece of information important, you know, that need to be able to understand the business requirements, you know, is an ongoing focus and also understanding what that means in terms of the context of some of these sources of information. But if we look at something like, you know, a social media field, you know, a JSON file coming off of Twitter, is that complete? Does it have the right integrity? You know, is it unique? You know, you think about retweets, you know, then, you know, sharing, you know, reshares. What does that do to the uniqueness? Does it mean to be valid or consistent in some of these contexts? Yeah, so there, as we get into some of these feeds on social media, other third-party sources, we have to begin to examine other aspects. So there's new data quality problems that come into play. How is this even gathered? What's the provenance of this information? Is it something I can rely on? Do I know how is it even collected? And for what purpose was it collected? That is something that we have to think about. And that, because one of the pieces that comes into play here is bias in the data. Has this data been structured in a way that it has particular ramifications in terms of what I'm going to learn downstream in machine learning algorithms that may skew my results in a way that I'm not expected? So we have to be looking at, you know, the mental issues in terms of bias. We have to understand data. If there's no standardized structure formatting, what do we need to do with that? Or if we're looking at continuously streaming data, what are the ramifications, you know, of that? What happens if I have a gap in that stream? How do I understand that and what that is doing to my data? Do I need to try and establish some level of consistency with other pieces of information? Or understand the changes that have happened to that data? You know, you think about kind of our information supply chain and what's coming through my web applications. And then it's gone into my CRM or my ERP systems and then funneled into a data warehouse. And then I've passed on some of that content into my data lake or particular zones in my data lake. What has happened along the way there? And what does that mean? Have I done levels of aggregation along the way that I may not be anticipating? So we need to be able to understand how that data has been processed along the way. This raises questions about whether, you know, the set of dimensions, you know, that we have or at least how we often typically define those, are adequate to addressing some of the problems. Now I'm not going to necessarily say that, you know, these are all, you know, brand new measures, but they're important aspects of the dimensions that we look at or extend those dimensions in ways that are going to be important to us. One of those fundamental pieces is provenance. Where did this data come from? Who gathered it? What was the criteria used to create it? That's an important piece because if I don't have confidence on how this information was gathered, when it was gathered, you know, we're getting, you know, third-party mailings is like, yo, do you want this mailing list that has this list of all these, you know, contacts? When did you create it? How do I know it's even valid for what I need to do? We need to be asking those particular questions and understand that and be able to apply that. Doesn't have the right coverage. Is it the relevant geography, you know, for us to be thinking about? Does it have particular bias in terms of how it's been gathered? Do we have all the points of data? You think about sensor data, you know, mobile phone data, weather records. Do we get a continuous feed? Do we get some gaps in there? If I have a gap, what does that mean? If I don't have this sensor data today, is that good or bad and different? Do I just walk around that? What does that do if I'm applying analysis based on some of this particular content? And does it, is it generating the right data? You know, and you think about things that may be injected into your content. You know, a simple example here is, you know, if I have, you know, temperature data from Boston, I have temperature data from New York. I kind of expect the temperature and heart for it to be somewhere fairly close to, you know, those particular pieces of information. But what if it's not? Does that mean the sensor is out? Does that mean I pulled from the wrong piece of information? Is there something wrong with the sensor? There's a lot of different questions, but it raises into questions in our mind about the value of that piece of data. And then as we're looking at transformations, you know, we're seeing, in different aggregations, what does that do to our data? Can we determine, you know, what's happened to that particular content? And if, potentially, we're getting repetition or duplication of that information. So a lot of different pieces for us to consider. But as we begin to look at some of these things, you know, you can begin to ask different questions on something like the social media feed. Well, what's the provenance? Okay, Jane Doe pulled those, she pulled all the items for a particular time period based on, you know, hashtag Blackberry. It could be, you know, hashtag Dataversity today, you know, that we're pulling in. Do we have a good sense of how it fits with, you know, the other information we're trying to use it with? Yeah. Any changes? No. It's good. We just, you know, the raw content, you know, that's often a useful piece to know as we're working with this particular component. But it includes all the relevant pieces, you know, are there acronyms? Are there other ways that we can say, you know, a particular piece of names, you know, is Dataversity have another acronym that we may need to hold on? These are all things that we need to be able to apply, assess, and measure as we're looking at new content. Okay. So let's talk about a little bit about new applications. And machine learning has its own particular aspects. There's a lot to machine learning that's, you know, well beyond what we're focusing on here. But I think it's very important, you know, as we're looking at things around machine learning and artificial intelligence and what we can do with these algorithms and models is, you know, they're still based on data. We're taking some data in. We're building models on it. We're trying it out, validating it. What's the quality of that data that we're using? Are there aspects of that data that we need to be aware so that, you know, what we're feeding in is quality data. And we're not, you know, if the model's good, we should have a good result. But, you know, if we're feeding junk in, what are we going to get in that model? And what's that going to do to our interpretations downstream? So this is an important aspect. And obviously there's a lot of different applications that are coming out here. Just, you know, some media examples in terms of, you know, marketing. We have targeted marketing recommendation engines, next best actions with risk management, you know, fraud detection, anti-money laundering are certainly good aspects of that. Know your customer. You know, these are things that are working with a lot of data that we're well familiar with. But we're now beginning to apply this in a way that we're trying to make predictions, you know, around, you know, this content and what we can think of may be a particular future based on this. So there's a number of data challenges inherent, you know, with machine learning itself. These are not necessarily data quality per se, you know. You know, just being able to find access and obtain data that may be useful, you know, is one of those challenges. It's not necessarily data quality per se. But we do begin to look at things like data cleansing and can do that at scale that we need for machine learning. Entity resolution, customer identification, data matching is a very standard data quality practice. We have some needs for real-time current data. You know, just being able to address the streams of information, the volumes of information, trace some of that. They all begin to touch on some of those aspects of data quality, even though they have broader data challenges as well. But when we look very specifically at some of the data quality challenges, you know, some of the things that, you know, we see here are, you know, incorrect and complete and misformatted or maybe even sparse information. You know, these are things that, you know, will impact the data sets that you're trying to apply from a machine learning standpoint. We can apply certain things in terms of correcting and standardizing that information. But there's a kind of a challenge there as well. It's like we can help to build some of that signal, if you will, by resolving some of these data quality challenges. At the same time, we have to be very aware of, are we introducing our own assumptions and assertions and biases when we do so? And does that then mask certain patterns that we might otherwise see? Aspects of missing context. So, you know, we look at some of these aspects in terms of new data sources, pulling new sets of information in. But, you know, do we get a complete population? Do we have other pieces of information that may be important to inform us about the populations that maybe we're not interacting with but could potentially interact with? One of the things we don't want to do is be so focused on the population of data that, you know, is very limited to our existing customer base per se, that we're then missing things that are important that are going on around that. Aspects in terms of multiple copies as well. You know, we have a lot of information coming in from a lot of different content, multiple third party sources. Who polled what? When do we have multiple copies that may be impacting it? Do we have duplication of information or partial duplication of information that may now be impacting what we're feeding into our models that we're not aware of? Being able to remove some of that duplication is going to be an important piece, but maybe even more critical is going to be, you know, can we really resolve this into single entities that we can make, you know, real valid decisions around? The other thing we do have to be cognizant of is various correlations that just may exist within our data. And this is one of the things that's very important about applying things like data profiling capabilities and dependency analysis as we begin to look at some of these data sources. You know, it's very easy to pull in a file based on particular content or contact information that has my name, address, other components. But, you know, we do have to remember that, like, on purpose of being high quality information in its original context, we have state codes, we have zip codes, we have other pieces of information that all basically correlate to the same point that we're from a particular geographic area. We may need to be pulling that out so that we don't have correlations being based in the models on things we already know and aren't giving us analytical insights that we don't know and really want to be able to get. To some examples of this missing segments of population, Hurricane Sandy, lots of tweets, a lot of information to correlate together. But the tweets weren't coming from the hardest areas because there were power outages, diminishing cell phone batteries. There weren't very many Spanish language tweets, even though a lot of Spanish speakers were affected in the weird missing segments of population. Similar aspects in terms of looking at something like the Boston Potholes and draw on accelerometer and GPS data that help passively identify potholes. But groups that didn't have smartphones weren't being recorded and incorporated into this particular piece. The information that's being collected will tell us a lot about where people who had smartphones were going and driving and where their issues were, but not necessarily about the broader issues around. So things like missing segments are going to be important and for us to acknowledge where we have population or data gaps. Similarly, we have to be very aware of noise coming into the process or inserted content. Bot tweets are, you know, I think probably, you know, the best example of this, you know, there's a lot of things producing lots of spurious information, whether it's based on elections, whether it's based on other pieces of information, maybe it's even things being fed in about particular products and, you know, people having issues with particular product. There's a lot of potential noise here. And how do we filter that out? This is the data quality problem. We have to be able to determine what's valid, what's invalid, we want to be able to make effective use of it. And then we have, you know, just simple bias aspects and, you know, we see this probably particularly in terms of certain types of unstructured data content analysis, you know, that may then brought into it, but, you know, the Blacksheet problem is the one that kind of caught my attention just because from an English perspective, English language perspective, I have lots of connotations about what Blacksheet is and what Blacksheet meaning and we bring that into a lot of different sources of information. Problem is that then creates correlations and connotations that don't necessarily reflect anything in terms of sheep and sheep population and what's the actual proportion of content. So bias can begin to work in a lot of different ways. This is just a very simple example. There's a lot of good information out about, you know, how we need to think about bias in our data. Good equality at scale. We've been dealing with this for a while, but it's a continuing ongoing problem in our recent data quality survey that's the things that we did. One of the, you know, two of the top three items were many sources of data and the volume of data as being real barriers to ensuring high data quality. You know, and I don't think this is surprising. At the EDW conference earlier this year, Michael Stonebreaker was talking about the 800 pound gorilla in the room and by that he was talking about a variety of data and the number of systems that have a lot of the same data or related or correlated data. We have to deal with a lot of information and addressing this at a level that's meaningful. So when we think about all of these data quality dimensions and the challenges with these data sets, we have to be thinking about how are we handling data volumes and distributed data. We're putting more and more data out, not just to do, but now onto the various distributed cloud platforms, whether it's AWS or Azure or Google Cloud or others. But this has impacted how we're looking at and working with data quality. If we think about profiling data, we have to deal with very high volumes. We have to potentially think about how we might approach streaming data, which is an area that we don't typically profile. But we're not in a position to have meaningful content if we're looking at individual records at a time from a data stream. So we may be in a position where we have to gather samples out of a stream or time segments out of a stream begin to make extrapolations around that. So we need to be thinking about either those high volumes or the streaming content as we get into look at how do we understand the data and data quality issues. As we're looking at standardizing and enriching data content or matching entities and not just master data now, but dealing with master data correlated with a lot of transactional data and mobile data and pieces that are coming in with partial bits of information, whether it's a device number, IP address, browser associated with that. How do we match that? How do we do that in a scale and often in a real-time basis? And do it in a way that's gonna allow us to meet service level agreements at the same time. We have to be thinking about the time windows that we're approaching data quality requirements in. And this is gonna require us to also deal with ongoing changing platforms that we think about in a cloud error as an acquisition of Hortonworks. Well, that changes potentially platforms or people who've moved on to distributed platforms and now moving on to cloud platforms. Can we apply the same data quality processes? Can we get our tools out into those environments and those platforms in a way that's gonna allow us to be effective and valuable? So there's important pieces as we look at handling these sort of distributed volumes. There's, we have to be able to address data quality functions consistently at scale, no matter where the actual processing is taking place, how that data is segmented, what the volume is. And the demands are constant here, but we still have to get, how do we parse that information? Do we need to standardize it? It doesn't have to be validated as we go through these particular processes. Those are all key parts of our data quality process. There may be more elements that we have to look at now in terms of looking at segments of population too, but these are still functions that have to be applied and they have to be applied at scale. We have to be able to look at this in terms of data enrichment as well. Where are we getting our particular sources? How are we connecting that across our distributed platforms? And when are we applying that aspect in terms of enrichment or lookups? Is it while we're going through our operational systems? Or is it downstream where we've suddenly pulled together in an X number of systems of records from our different siloed business applications and not need to aggregate it and who has the right reference information? Have we even standardized it to that point? Those are going to be considerations as we look at the overall application architecture. There's some also additional considerations as we look at things like profiling also to send in sort of joining assorting information but matching information as well. If you think about something like data profiling, we have to be able to find these outliers. We need to be able to look in effect for the needles in the haystack to be able to say where are these particular issues? If we're just looking at particular samples of information, that's useful for certain types of things. If we're applying how we model what the data is going into but as we really look to understand the data quality problems more and more, we have to look at the full volume. We have to find these outliers and be able to understand how they got there and be able to address those particular challenges. That means we have to be able to apply our profiling across clusters of information. We have to be able to aggregate across that. We have to be able to aggregate the frequency distributions and be able to provide access to pieces that we want to be able to drill down to in a time-effective manner. Similarly, from a standpoint of entity resolution, distinguishing matches at scale. There's a lot of information we want to look at this which means there's a lot of cross comparisons and we have more different components that we're now looking at. It's not enough to be able to say, okay, I need to pull together things that have pieces of information from based on this segment of name content, this segment of address content, maybe some phone information like social security numbers or tax IDs, national IDs, but now I have vice IDs. Now I have IP addresses. Where have I collected that? How have I linked that into my master data information or my single customer view? The number of types of puzzle pieces are growing and the things that we need to do in terms of comparing not just once across these couple of dimensions, but to compare this way. And now I need to compare in this other dimension. In this other dimension, so there's a growing number of comparisons we have to do that are going to necessitate us handling high volume comparison information. Example, going back to one of the core use cases and we see a lot of global banks dealing with anti-money laundering and doing this on a distributed platform. This is critical, almost every one of you is gonna be dealing with aspects in terms of fraud, credit cards, same type of thing in terms of dealing with anti-money laundering and the transactions going through the particular bank. We have to be able to leverage machine learning and scale to be able to understand new and emerging patterns that requires large volumes, current, plain information. That information has to be accessible, be able to be fed into the algorithms, means we need to be constantly thinking about how we cleanse, standardize, match that information and do it in an effective way that's then gonna support the algorithms downstream. So all these pieces begin to tie together as we look at new types of data, machine learning, handling data at scale. These come together in terms of how we're looking at data quality. And then really that last trend, data literacy, data democratization. What does that particularly mean to us as we're looking at data quality? Well, from a data literacy standpoint, the data democratization ideal is that every one of us in our organizations and all of our colleagues in our organizations are gonna understand data and we're gonna understand some of the basics of data. We're gonna understand the business context, business language that goes around that data. We're gonna understand some of the basics of data. What is a data structure? What's the data type? What does it mean for me to be looking at numerics versus alpha versus alpha numeric information or a range of date formats? And what does that mean in terms of how I'm looking at and thinking about that data content? I have to think about how do I find and access data? How do I use it? But as we begin to look at some aspects in terms of data quality, it's gonna be things like basic statistics. If I'm looking at a frequency distribution of dates and I say, okay, well, January 1st, every year, it's twice the volume or three times the volume of every other date in the year, what does that mean to me? What can I interpret out of that? Odds are, somewhere is using that as a default date that's not in line with what we'd expect the typical distribution of a population to be. Some of those basic statistics are gonna come into play. We need to understand the data quality dimensions. We need to be thinking about what are the things I need to be looking at and what questions do I need to ask around this information? And then what techniques and tools can I apply to this? This is all part of our data literacy, part of what we need to be able to do. And one of the things that I really appreciate about data adversity being able to provide a lot of this type of content and learning available to a large group of individuals. But this does come up against aspects in terms of resource constraints. There's only so much time in the day we have to begin to think about this and to understand some of the data quality challenges. And it continues to change. And that's why it's important to look at some of these emerging trends so that we get that in our mindset as well as we look at how do we approach this? And I think that that's really sort of a kind of a good segue into kind of the last pieces I want to look at here, which is some of the things, some of the approaches that we need in place to begin to address some of these emerging data quality trends. And data literacy is really happy. It's an emerging trend. It's also one of those core approaches that we need to have in play. How do we get best practices about data quality available for everyone in our organizations? Anybody who's working with data should be thinking about these particular challenges. That means there's best practices that need to exist somewhere. There's some basic things that we need to be thinking about. Some of these are what I would term sort of our universal best practices. Scope, the right questions, understanding data requirements, thinking about bias, getting the business context right. Because we know that customer may be one thing to this part of the organization and something else to this part of the organization. We have to understand those different contexts. How do we begin to address and resolve or do we resolve those data quality issues and apply the right data governance processes? And then we have to begin to think about how we solve some of these challenges at scale. So breaking that down a little bit further, communication, really central to the culture of data literacy. If we're not talking about these things, if we're not discussing it, if we're not helping people understand how to ask questions about the data, get trained on how to understand and use that data, then we're not gonna be able to be effective. We are gonna still be plagued by different aspects that are gonna come into the data that we're not anticipating. So we have to be trained to understand and use the data and also understand how do we approach and evaluate data quality, whether that's traditional data, whether that's what's in our spreadsheets, whether that's what we need for our machine learning algorithms. But again, understand the business context that's coming into play with these data sets we're presented with. It's critical for us to have programs of data governance as part of this. These are the foundational processes and practices necessary for success that are gonna allow us to measure, monitor and improve data quality where it's needed. It's a continuous iteration. None of this is one ton. This is a cultural aspect. And that's one of the things I think we've seen repeatedly at a lot of our core data, quality data governance consensus over the years have the importance of programs of data governance, the importance of communication and the culture of data literacy. And having a place, basically a center of excellence or a knowledge base where you can go to find answers. These are central to establishing an effective data quality program. We know some of these core challenges, common terminology, organizational barriers, isolated or unknown work. These are all things that are gonna impact our ability to be effective and really drive out a communication program. So we need to be able to have that place in a common language. Way to gain broader buy-in. How do we continue to get everybody in the organization to really understand the impacts that the data has? And this is where it's so important to be able to provide upfront examples. Things that I can get out of a data profiling process or I can get out of applying business rules where I can get actual measurements and say, is this an issue? And how do I begin to raise that up and understand what the value or cost of that issue is? Scope is important. As we look at data quality, going beyond that communication to, looking at programs of data quality, how do I understand the business objective and problem? We have to ask the right questions about the data and we have to empower users to ask those questions. As I begin to look at something like a machine learning application, I'm gonna have some different questions coming into play than I necessarily would on an upfront web application or an operational application. I need to think about all the pieces that are gonna come into play and where those segments of data are that I need to bring into that piece and in that picture. So I need to look at, do we have the data required? Do I understand those characteristics? This is a constant process with any application, with any business initiative that's leveraging data. We have to figure out what's fit for our purpose. Have we evaluated the data? What are the answers that we can bring out of this? And ultimately, we're driving to understand what are the critical data elements that we need for these particular pieces? We're not asking for what's the best data set. We don't understand what the best data set if we haven't asked these particular questions, if we haven't understood the context, if we don't understand the data elements. Once we have that context in place, now we can begin to say, what do we expect? What are those measurements that we need to apply? How do we put that into our overall governance process so that we can deploy these and apply measurements on an ongoing basis? And that leads into that sort of step of quantifying everything that we do. So a lot of hidden activities, a lot of things we spend our time on that's not necessarily effective. There's a lot of things that we don't have a transparent view in. The more things that we can bring into our view, the more transparency we can bring in through things like baseline measurements, understanding what the measurements mean and why they're important and what the business value or the business process is, how do we look at monitoring that information and reporting on it in a way that's gonna be collaborative and continue to communicate with others, these are gonna be important and we're gonna have to do this repeatedly. None of these approaches are new. These are things that we've been trained to do from a data quality perspective for years, but there's more people that we need to communicate these practices too if we really wanna have a true data literate culture. And this is gonna tie into leveraging tool. It's, there's a communication aspect. How do we get out to everybody? How do we put processes in place? How do we get people educated on that? How do we put tools in their hands as well? It's gonna be an important piece to look at the challenges of scale. It's gonna be a challenge to apply consistent processes and this is I think one of those things that we often overlook from a data quality perspective. It's one thing to say, okay I understand the data issues here, here and here but if I'm not standardizing things in a consistent way or if I'm not applying consistent approaches for resolving entities and matching entities and I'm doing one thing here, one thing there and I'm ending up with these different data sources or data sets that have different levels of information, different aggregations. I'm still not gonna be able to build trust. I need to be able to look at consistency of what we're applying to the data at the same time so I can have that sense that I have the right measures and the right corrective actions in place and do that throughout the overall process and I get a solid, trustable result at the end of the process. So these are the things we're gonna have to look at. Deploying effectively over time and really think about how do we do that in a way that allows people who are not technical experts to be able to get routines and processes and rules in place without having to think about the technical environment that they're applying into. Being able to think about a simplifying the process and Dyn wants to deploy it anywhere. We think about this from a profiling standpoint all the time, it's just like I have a profiling process and I deploy it down, I connect the data, I run it, I get results. I'm not thinking about what's happening in between, it's a consistent process because we've built that and established that as a standard routine. Other things we're often building from hand, we need to be able to establish that ability to deploy those pieces as well consistently whether into the test, into production from on-premise to cloud, from one cloud to another but we're valuing the data quality skills and the data knowledge and the data literacy and not trying to resolve technical issues all the time. Fundamentally, data quality is still data quality. Whether at scale, whether with new data sources, old ones, whether we're trying to establish practice of data literacy in our organization, data quality is still data quality and it's something that we have to embed as an ongoing process in our practice. So thanks for the time, let me turn that over for any Q and A that we may have at this point. Carol, thank you so much for this fantastic presentation. Just to answer the most commonly asked questions here, there's a high demand for the slides and the recording. So just a reminder, I will send out that links to both of those in the follow-up email which will go out to all registrants by end of day Monday. So diving in here to the questions, the hot topic, Harold, and always the question that comes up, so who should be involved and responsible for data quality? Making sure that it's accurate and not old. Well, if we really want to say that we have democratized data, then we have to be saying everybody, you know, I think, you know, this is a goal of a data literate culture. We have to be saying, everybody is responsible for data. Yes, we obviously establish data owners, we have subject matter experts who are gonna be able to provide insight, but anybody who's working with data, who's working with, you know, business initiatives and business processes has a responsibility to look at and understand what they're working with. Now, some of that may just be at a level of looking at, you know, what's the business context, what's the process, but being able to spot, this does not look right. How do I ask the question? How do I raise this as an issue and who do I raise it to? And this is part of that aspect of really bringing a data literate culture in place. Yeah, this is not an easy task. I mean, this is what, you know, we see chief data officers charged with or data governance councils charged with, but it can't be done alone or in isolation. You know, it comes both from a top-down acceptance that data is really a valuable asset. You know, I think that's one of the things that Doug Laney is often preaching is, or data is an asset in your organization and you need to be able to think about it from that standpoint and do that right from the executive management. We are gonna value data and we're gonna work that through and we're gonna need to put things in place that help individuals become literate about the data they're working with and feel an ownership and a responsibility for helping to get it right at the points where they're working with that data. How does one socialize data quality for everyone? Well, it's been an ongoing challenge. I mean, I think I've seen this over the years at our various data governance and data quality conferences. How do I approach that? I think we've had a lot of good sessions at these events talking about that. I don't think there's really a one-size-fits-all. I mean, I think this is gonna be very much an organizational process in terms of putting things in place, but it's gonna require having a number of capabilities that really are fundamental data governance processes and practices. How do I begin to establish a center of excellence or a knowledge base? We see this, one approach people have taken through the business glossaries and the data catalogs so I can begin to say, okay, here's a common language. Everybody can quickly reference a term and be able to say, okay, what is customer? Okay, this is what it means, but be aware that in this part of the organization it has this understanding. So those business glossaries, policy management systems, some of the things we see in the data governance tools provide us some of that way to begin to collect and centralize some of that focus so that everybody can collaborate and contribute. And I think that's gonna be really a core starting point in a lot of these journeys or having a way to, sometimes it's gonna be starting small where it's gonna be in this particular unit and I'm gonna put, I know I have a problem here and I know this is impacting our business in this way. How do I measure that? Does this resonate with execs and data owners who are in line of business and higher management so we can begin to say, how do I expand on this and put things in place that are gonna help us drive more revenue, reduce costs, or reduce risk. It's always, I think, gonna come back to a lot of those business equations that are gonna be key drivers in being able to move some of this forward. Key words there, certainly. And so our data validation will apply to data as it is ingested and processed in the data lake or is it profiled after as this data is ingested? And to follow up on that, how would data be remediated whether data issues are determined before with validation rules or after through profiling? It's a very good question. I think one of the things we have to be thinking about is data is not static at particular points in time. We've pushed information through what you might term sort of a supply chain. These things are gonna come in through all sorts of different applications. And I think it's very important for us to have validation pieces right up front. We have web applications and the like in in terms of how we collect data. At the same time, we have to be aware that if I go onto somebody's website and it's to order something and I'm gonna need to be able to put my shipping information so that whatever I'm ordering gets to me. But if I'm ordering pizza, I don't wanna go through that. And I don't wanna be hassle. And if you're gonna hassle me on that, I'm gonna go somewhere else. So there's gonna be these sort of things we have to take into account up front in our applications. There are gonna be things we need to have validation rules around right there. And those are gonna ripple through pieces of our processing. Worked with our organization quite some time ago in terms of dealing with logistics and operational supplies. Critical for the goods that they provide to get delivered on time when needed ASAP. That's mission critical. That has to be done. Validation rules in place for those particular pieces. But as long as things got out the door, that was good. But moving downstream into analytical pieces. Well, there's some things going on in the operational side of things that don't necessarily work well from an analytical standpoint. What do I do at that point? Well, that's where you begin to have to put in additional checks downstream. And that may be in your data warehouse. It may be in your data lake. Maybe it's when you're ingesting so that certain types of things coming in are gonna be weeded out. Maybe that's gonna be on certain types of third party data. If it doesn't meet this level of quality, there's a service level agreement for this data that I expect. It gets stopped at the point of ingest. Other things maybe coming in from our data warehouse into the data lake. It's already gone through the level of checks. I don't need to put additional validation checks moving that over into my data lake. But I need certain other things in terms of profiling that just to make sure that I'm up to date with the particular rules that I may need to connect data together. So it's gonna vary where you are in the supply chain. What's prioritized in terms of operations, analytical needs, learning applications that are downstream and certain types of content, whether it's coming from the core operational systems, whether it's coming from third party systems, other sources that may have a low level of trust and confidence. Some of that may indicate gonna do checks on ingest. Other times, bring it in. Let me evaluate it. Maybe it's gonna be more of a sandbox approach. Different models for different places. But that's also part of what we have to apply in that data literacy context, is understanding I may have different rules and requirements at different points in time. Well, Harold, thank you so much for this fantastic presentation today. Always a pleasure doing the webinar with you and thanks for sponsoring today. But that is all the time we have for today. Again, just a reminder, I just wanted to follow up email to all registrants by end of day Monday with links to the slides and links to the recording. I appreciate all the great questions and the engagement from the community. I love the networking going on throughout the presentation. There's some really cool stuff happening. Thanks everybody and hope you all have a great day. Harold, thanks so much. Right, thanks everybody. Thanks all.