 Hello and welcome. My name is Mark Horseman, data evangelist with DataVersity. Thank you for joining the latest in the monthly webinar series, Data Architecture Strategies with Donna Burbank. Today Donna will discuss data quality best practices sponsored by Monte Carlo, a couple points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them via the Q&A panel. And if you would like to chat with us or with each other, we certainly encourage you to do so. And just to note, chat defaults to send to just the panelists, but you may absolutely change that to network with everybody. To open chat and the Q&A panel, you will find those icons in the bottom middle of your screen to enable those features. As always, we will send a follow up email within two business days containing the links to the slides and recording of this session and additional information requested throughout the webinar. Now let me turn it over to our friends at Monte Carlo and Ryan Kearns for a brief word from our sponsor. Ryan, hello and welcome. Hi, Mark. How's it going? Thanks so much for the introduction. So hello, everyone. As Mark said, my name is Ryan. Excited to get us kicked off with just a little bit of a primer before I hand it off to our main participants today. So we'll be talking about data observability, enabling proactive data quality, and just to give kind of a high level overview of how we think of these concepts before we dive into some detail. Again, my name is Ryan Kearns. I'm one of the founding data scientists here at Monte Carlo. I've worked on the anomaly detection algorithms sort of from the beginning of the companies. I've been quite involved in watching this process formulated in practice in an actual industrial setting. I'm going to talk for 10 minutes or so just to give a kind of a breakdown of the main high level points from our point of view, and then I will be present in the host and panelist conversation and able to respond to Q&A. So if you have any follow-up questions, I'm very happy to take those and continue the discussion kind of async after that. So what is data quality and particularly what is data downtime? I'm a data person. I find myself using this slack emote all the time. I think in a lot of ways, if you're involved in the development of the data platform at a company scaling on the order of what we're experiencing here at Monte Carlo, you may be feeling like this guy quite a lot. We're going to be talking about putting a name to this phenomenon, which we call this data downtime. And so a lot of our intentions as a company, as our vision is to eradicate data downtime in the modern data stack. And by data downtime, I'm referring to periods of time when data in your system is partial, erroneous, missing, or otherwise inaccurate. This is happening all the time across modern data organizations right now. It's something we've got accustomed to constantly fixing. We spend a lot of our time dealing with bad data, and really it starts to feel your workspace resembles a little bit of this meme. To give some stats to the face of this problem, the bigger challenge, the business impact of poor data quality and data downtime is actually quite severe from some market analysis and customer research we've done. We estimate around 70 high severity events per year for every 1,000 tables in a data system. What that can mean is for a data team, for the practitioners here, up to half of your time spent fire drilling or doing triaging work, basically being reactive to open extant issues in the data system. And that has a real cost. It has a real business cost. And so if you want to justify an investment in data quality and an investment in the reduction of data downtime, you're looking at from our estimates up to a quarter of annual revenue lost for a lot of data driven organizations resulting from their poor data quality. So it may feel like you're just spending a lot of time firefighting data pipelines, and that might be the nature of your job. But there's a real significant cost to this from that the data leadership and high level perspective. So why is that happening? Well, I think what we sort of realized is that a lot of data quality incidents, at least in the the modern data stack are resolved and detected reactively, retroactively. And so you might be familiar with application based observability. There's a lot of companies doing quite now established and professional work in this area, whether that's new relic or data dog or what have you. Writing unit testing and doing sort of proactive code based detection measures for application downtime or site reliability engineering. We really think that in in the data space, you know, you can get a little bit of that with ad hoc manual testing. The vast majority of data downtime that you're experiencing will often be at the downstream consumer level, and sometimes days or weeks after the incident has actually occurred. So you might be looking at a dashboard that you're about to present to an executive and looking at the stats in a breakdown and saying, this is supposed to be a probability, but these numbers aren't adding up to 100. So what's going on there? Why do I have an extra column that says null? Why does it look like I've got, you know, twice the volume of data that I'm expecting? So it has something been duplicated at my copying this over twice. And experiencing that at the reactive level downstream is where that cost comes from. It's actually quite difficult oftentimes to acknowledge and prevent data downtime from the outset. Fortunately, given our time having spent a lot of time talking to customers and understanding this problem in the industry, data downtime actually tends to have a pretty similar pattern across organizations. This is to say you're not alone. I think if you're experiencing these types of issues with your data pipeline, generally we found that you can bucket data downtime problems into these canonical categories. And so you might find yourself asking the same questions again and again when you're firefighting data pipelines. So is this data up to date? Do I have the most recent refresh of this model? For example, does the size look off? Am I looking at something like an exploding join? Should I have, you know, 100x the volume of this table that I'm seeing right now? Why has my streaming pipeline, you know, turned to a trickle all of a sudden? Why am I receiving no new records? You might be looking at the distributional level. So is this particular value way too high? Can I think of some canonical validation? So I mentioned earlier, if a number is supposed to represent a probability, it's probably supposed to be bound between zero and one. So maybe we want to be thinking in structured ways about data downtime of that form. You know, the common problems with data systems, null values cropping up everywhere, duplicate values in a column that's supposed to be unique and, you know, formatted like an ID. And then you might also be thinking more holistically about the larger data platform. So if I change the schema up here, if I change the type of this row, is it going to break a bunch of downstream applications? Or how do I even know who's using this model downstream of me? Can I delete this deprecated column? Or is there some load bearing application, you know, randomly downstream in the pipeline that's going to be relying on it? And so we've done some work kind of trying to systematize these types of questions. And we've arrived at a sort of design philosophy for what we call data observability, which is this end to end holistic approach to data quality. And we've broken down our main approaches into five pillars. So that's freshness, volume, distribution, schema, lineage, freshness, again, looking at the extent to which data is up to date and fresh in a pipeline, volume is the size of your data, another sort of metadata metric, distribution is another kind of data quality measure that drills into the row level, specific information in a table. So looking at things like the rate of null values or the rate of distinct values and so on. Schema is the high level kind of metadata overview, you know, what columns do I have and what are they typed and are they typed the same upstream, downstream, etc. And then lineage is finally this holistic picture of your upstream and downstream data sources. And that can even extend beyond, let's say just your cloud data warehouse, but if you're looking at, you know, business users at the very end of your application, so people on Tableau or on Looker, or even if you're looking at ingestion and maybe you have, you know, glue set up so you're rereading unstructured data from Parkett files or from S3 and you're turning that into structured transforms. So you want lineage to give you kind of the progeny of your data when it comes in unstructured and goes through your whole analytical pipeline. So that's the high level kind of, I guess, philosophical overview. That was a super blitz round. I'm going to hand it over just a second here. But just one last word from us at Monte Carlo. We've got this insight, if you're interested in kind of reading more about this, I think we're going to put this link in chat if you want kind of further breakdown. Again, you know, I'm just here for a tiny bit before I pass it on, but I'll be in the Q&A, I'll be in the chat, so feel free to reach out as we're going through and I'll be back at the very end to tackle any questions as we kind of dive into the details here. Yeah, thank you, Mark. I see that put out in the chat for everybody. I'm going to stop sharing screen. I'll hand it back to you, Mark. Thanks for lending me your attention for 10 minutes. Thank you so much for the wonderful start to the day, Ryan. That was a wonderful slide deck. Now let me introduce the speakers of this monthly series, Donna Burbank and Nigel Turner. Donna Burbank is a recognized industry expert in information management with over 20 years of experience, helping organizations enrich their business opportunities through data and information. She is currently the managing director of Global Data Strategy, where she assists organizations around the globe in driving value from their data. She has worked with dozens of Fortune 500 companies worldwide in the Americas, Europe, Asia, Africa, and speaks regularly at industry conferences. She has co-authored several books on data management and is a regular contributor to industry publications. Nigel Turner has worked in information management in related areas for over 25 years. This experience has embraced data governance, information technology, data quality, data governance, master data management, and business intelligence. He is currently principal consultant for the EMEA region at Global Data Strategy Limited. He is a well-known thought leader in information management and has chaired run tutorials and seminars and presented at many international conferences. He has also designed and run data governance training events for government universities and other organizations. Nigel is very active in professional data management organizations and is an elected data management associate in the UK committee member. He was the joint winner of DAMA International's 2015 Community Award for the Work Initiated and led in setting up a mentoring scheme in the UK where experienced DAMA professionals coach and support newer data management professionals. He is based in Cardiff Wales. And with that, let me give the floor to Donna and Nigel and their presentation. Hello and welcome. Thank you so much. Always a pleasure to join these. And if this is your first time joining as an attendee, welcome. This is a monthly series that we've been doing for the past few years now. And all of the one of the most common questions is, will this be recorded? And can we get the slides? Yes, to both. And data diversity is very good at, you know, keeping recordings of all of these out on their website. We have post links on our website as well to all of these. If you anything in the past presentations look of interest to you, you can always catch them. And I hope you'll join us for some of the upcoming webinars, you know, later in the year. Metadata management is always near and dear to my heart for next month. So, but today's main topic is data quality. And this has been a yearly event that we invite my partner in crime, Nigel Turner, who has a rich experience in data quality to join us for this session. And I hope we all enjoy that bit of a mix of our experience. So what we're going to talk today, and very much agree with Ryan from Monte Carlo that, you know, data quality should be proactive. And it is definitely, you know, business impact to the business itself. And it's more than just tactical firefighting. And so by their nature, data qualities are kind of, you know, within an organization or beyond an organization, you don't always have control directly over some of the data quality of the sources you're using. So what we like to do in these webinars and hopefully we succeed is keeping it really practical and actionable. Nigel and I do this in the real world, full disclosure, we run a consultancy that does this day to day. So, you know, what we're here to do is share some of our scars and learnings with you to hopefully make it a little easier as you go through. So these are kind of based on real world experiences. So and that said, we do kind of want to tie this into a larger vision and a larger strategy. And also some of the industry best practices that Nigel will talk about. So this is our kind of enterprise data strategy framework. And you'll see there with the box that data quality is an important component of that. And it's also related to a lot of these other components. So why we make this a framework, you can't do data quality in a vacuum. So data quality without data governance is not going to be successful. You know, it's a people process culture. Data quality without a proper data architecture isn't valuable. As Ryan was talking about, you know, this idea of data lineage and metadata management is a part of data quality, right? So we can spend all day on all of these touchpoints and we won't. But we will talk about kind of how you do need to look holistically at data quality. It isn't so much to say, Hey, there's much of nulls. Let's fill them in. What are you going to do about it? So we as a firm always look holistically and we'll hopefully apply that to our best practices today. So as we mentioned in our bios, both Nigel and I are full disclosure big fans of Dama International. It's a great resource. And if you're not familiar with the data management body of knowledge or the Dama D'Ambach, it's a great resource. We use a lot of that in our practice as well. So I'll pass it over to Nigel to kind of talk more about how data quality fits into that. So Nigel. Hey, thank you, Donna, and good morning, good afternoon, good evening, everybody listening, depending on when you're listening. And as Donna said, you know, data quality is a key discipline in its own right. But I think the thing to understand about it as well, that if you want data quality to be improved, then you have to consider all the other data disciplines that impact upon it. And equally so, if you want the other disciplines to deliver to their optimum capabilities, it's very difficult to do that without the good, solid foundation of data quality. So we regard data quality very much as what we call a foundational discipline. In other words, it's something that underpins and builds the architecture for other data management disciplines. And just to mention a few here, you know, data security, it's a pretty well known fact that if your data is your poor quality, your data becomes less secure and more open to fraud. It's much easier for people to come in create false accounts, for example, if you can't link accounts back to customers. Reference of master data depends heavily, as we know, on good data quality as its base. Data warehousing and BI. The first thing I ever learned in data management is garbage in garbage out. And it's as true today as it ever was, probably more true today than it ever was. So data quality, yes, is a discipline in its own right, but in order to deliver data quality, then other disciplines need to be considered as well. Things like architecture and governance in particular, which are huge enablers today to get to data quality. And yet despite its importance, problems continue with data quality. And I came across some headlines in the UK recently about some issues with military emails. I don't know whether many of you have seen this story. There's the scream on the right-hand side there, which many of us feel when we see these stories in the news. But back in July, it transpired that the US military had been sending about 10 million emails to the government of Mali, which seems a bit of a surprise given the government of Mali is closely allied to Russia. And at the current state of our relationships in the West with Russia, that seems like a bit of a strange thing to do. And then lo and behold, a month later, stories also emerged in the UK that our own M.O.D., Ministry of Defence, have been doing the same. And when you start to investigate, well, how come so many emails were misdirected to what potentially could be regarded as a hostile government? It's all because of the domain name. And that the domain name of the US military is .MIL. And unfortunately, lots of people sent emails which were sent to .ML, which is actually the main address of the government of Mali. And what was even worse, these emails contained things like sensitive documents, passwords, even the travel plans of senior military personnel, such as five-star generals and colonels, for example. So that would have been gold for a potential enemy. So I think what goes to prove again, what seems to me to be a fairly easy problem to resolve has not been resolved. And these simple data quality errors can cause major problems, major embarrassments and could actually endanger people's lives. So these problems continue. And I think as well, when you move on and talk about the sort of world of things like analytics, and Ryan has already alluded to this, still today, you'll see that data scientists spend an enormous amount of their time really just sorting out basic data quality problems. I always regard sometimes some data scientists who are on six-figure salaries could be regarded as the world's most expensive data cleansers. And it's also the part of their job that they dislike the most, as you can well imagine, because they want to get on and do their clever stuff with data science and analytics. They don't want to mess around just cleaning up basic data problems. So even in this area costs of failure are very, very high. So what data scientists and analysts need are trusted data sets. And trusted data sets basically means, and as I've already covered, and Donna's already covered, to build trusted data sets involves a full range of data management disciplines, where for example, data architecture helps to define the data standards that is required to enforce good data quality within an organization. That impacts on privacy, and you want your trusted data to be accessible only to the people that should have access to it, you want it well-governed. And of course, you want, as Donna said, already effective metadata in order to ensure that people can actually derive meaning and the correct meaning from that data. But the question is, how do you go about deriving and developing trusted data sets? So I'll hand back to Donna now, who'll talk about the way that we do it in global data strategy. Donna. Great. Thanks Nigel. So you're getting the theme that we're going to keep harping on, because it's true, is that data quality is multifaceted. And it is that what makes it a challenge, and why you have to continually look at, is that it's that combination of people, process, and technology. So although technology is important, and we'll talk a lot more about that in this session of how you need to manage and monitor and remediate data quality, really it's people putting data into those systems. And it's business rules that define, is this data right or is it wrong? Some things can be looked at, are there nulls? And there shouldn't be. That's kind of more of a technical sign. But are these the right domain values? Is ML and MIL the same thing? Some of that machine can't pick up, that only after the people have defined these business-centric business rules can a computer then validate against that. So people in the business define what's right and what's wrong. Also business process, and we're not going to go deep in that today, but from our experience, so many of the problems are a business process problem. We often do, we start our data quality or master data management, or we won't keep harping on these things, they're all related, but they are. With a business process workshop, how is the data inputted? How is it used? Who's touching it along the way? Your good old-fashioned cred matrix. One of our biggest data quality success stories, we run a workshop and for the vendors and suppliers, the payment terms were already off. Our payment terms 30 days or 90 days. If you're trying to forecast your revenue, that's really hard if you have the wrong payment terms. When we did the process map, what was done at contract time wasn't being passed on to sales. So sales was sort of making up their own payment terms for each contract that they were writing. In the workshop, people just said, I didn't know you were already doing it. Let's pick one person through stewardship and have it entered once. So that was kind of an extreme example. We're just looking at the people process, solve your data quality. People were not communicating. And then it took the technology and then monitored that over time. We've created that rule. We can now instantiate that rule through something technical and moderate. And are we continuing to keep that, or want it to be a fact of Mary who made that decision left? And then we forget all about it when Joe comes. We do something different. So it is that what makes it effective, but also challenging, is whenever you get more than one thing in the mix, it adds to the complexity of why things like governance and things tie in as well. So how does one tackle that? I think we all have converged on that this is a problem or we probably wouldn't have been on the webinar. So this is our methodology that we're happy to share here with you. We call it, or Nigel actually came up with this term, you know, the A to E approach. Why? Because it's kind of a phased approach that, you know, like anything isn't rocket science, but it's nice to have a framework to kind of hold yourself to these steps because it is a process in this ongoing process of continual improvement. It's not so once and done, right? And Nigel and I will both continue to harp on this. We've run into many different organizations. Well, we did a data quality cleanup, and now we've moved on. Well, it's just sort of like if you have a pond and then streams feeding water into that pond, you can clean up the pond, but if the streams coming into that pond are still polluted, you're going to have a polluted pond next week, right? So how do you not only clean it, create those rules, learn from that, and then make sure that that's always part of your day to day business operation. So the first is important and obvious, but often forgotten, you know, let's assess the business usage. You have a lot of data in your organization. What is the most valuable? What do we need to focus on? Is it customer data because we need to do a, you know, a big marketing campaign? Is it employee data? Because we don't know, you know, the skills and hiring, who should hire for our employees. You really need to prioritize. You can't do everything. I mean, we've seen it all the nights. Well, I've been in the business for a long time. We came into one project and they outsourced some of their data quality analysis and then the team came back and they said, you really have a problem. And this was last year or so. You have a problem with your fax numbers, you know, half the fax numbers are empty. We need to fix that because it's half null. And we just sort of looked at that with a business who uses a fax number anymore. It's null because no one needs it. I mean, we just take that out of the analysis altogether. But what about email addresses or IP addresses, right? So really understanding what you're focusing on. And that seems obvious, but it's probably the most important thing. And then what's going to have the biggest business benefit in ROI? We'll talk a lot about that. And then baseline, where are we today? What are the rules? And then, you know, what's the benefit? And then where are we today? Where do we want to head? What's that famous quote? You can't manage what you can't measure. And data quality is no different. And Ryan touched on that as well. You really need to quantify and continually monitor this. So what does good look like? And how do we achieve that and continue to optimize it? We'll talk a lot and a little bit later on this side. They have a data quality dashboard, right? If you're running your business of certain KPIs and data drives that business, you should also have similar KPIs for your data, right? And those might change just like your business KPIs. Maybe this year, it's all about new customer acquisition. And next year, it's all about customer retention, right? It could be for your data quality. Right now, it's on a basic customer address. But once we have that pretty much locked down, maybe it's on, you know, customer credit scores or something else, right? It may change over time. And often we have, as part of governance processes in your committees, you kind of decide what are those key KPIs we're really focusing on at the moment so you can kind of improve over time. You know, and then that converging on business critical areas, why that's a key part of it is some of it's, you don't know what you don't know and it's iterative, right? We know we need to look in this area. We do some basic data profiling. You learn, you go back and you say which of these data profiling areas do we need to look at? Are we seeing business rules both top down and bottom up? Sometimes you look at the data and it doesn't match what we thought those business rules were, right? So all of this is iterative. And then, and I know Nigel will harp on this because, you know, we've encountered this a lot, develop the improvements in a plan to continually evaluate those and look at the ROI over time. And I won't steal Nigel's thunder, but he'll talk about this, I'm sure, which is, you know, you don't want to have a data governance committee. And then you monitor, and we've seen this, okay, now we have 5,000 customers with null addresses. Okay, on to the next topic as part of your data governance. What are you going to do about that? Is that important? What's our plan for improvement? And do you develop, develop, quantifiable data improvement plans with ownership and stewardship? And then did that work? And what was the value from that? We work with one non-profit, and this was one of the easiest success stories we ever saw. They had kind of done some data quality cleanup on their donor list, you know, as a non-profit, they kind of got a lot of their revenue from donors. They did some address cleanup, both email and physical address. And then the next year, they found out that it was $100,000 as a result from those addresses, they had cleaned up and rectified. So had they not cleaned up those addresses, they wouldn't have reached these people. And it was a direct, you know, a direct correlation to data quality for that $100,000, which to them was a very large sum. So you don't always get such a direct one-to-one correlation, but be thinking of that, even in the beginning, why you're doing that or everything should have a why. There's a lot of things we could be doing each day. What's going to have the highest business benefits? So we'll go through, and Nigel and I were kind of tag team across each one of these. What do we mean by assess business due to, you know, baseline data sources, et cetera. So as we move ahead, this first day assess, and I can't touch on this enough, what are the business priorities? What are your goals? Which data is going to most affect your biggest critical data issues? And then where is that? And how do we kind of move ahead? So one way to do that is to just write it down, right? So what are your biggest issues? And we suggest a lot of different ways to do it. This is a super simplistic, almost, you know, spreadsheet type approach. What are your biggest issues that you're seeing? You know, customer data duplication, product data inconsistencies, poor training of the data entry people, right? It's not always just a technical issue. Could be our people, the people know the rules to be putting in where it does need to be entered by a human, right? What is a brief description of the problem? And importantly, what is the impact of the problem? Again, if your problem was empty fax numbers, who cares? Sorry for those few people on the planet that still loves fax machines. I never liked them for the record, I find them annoying, right? But yes, we don't have emails and we're trying to do a big email marketing campaign, that's going to be a problem, right? Or maybe someone might say, we don't use it even anymore, it's all over IP addresses and things like that, right? And then who raised it? Who cares? Who is going to be affected by that? Who may be directly hands on in some of the data remediation? These may be your data stewards, right? Your data owners. So documenting it, tracking it, and then Nigel will touch on this at the end, go back to this and then say, did we solve it? Are these still the right people? What was the benefit of fixing data duplication? And then it becomes that continuous cycle as we move on. And so when you're looking at the business case, generally the benefits fall into one of these four categories or some flavor of these four categories. I think the easiest one to quantify is this idea of decreasing costs. Data quality is Nigel touched on, a cost of failure. It's an inefficiency in the organization. And some of it's just the time that we all just, all of us touched on, which is that your expensive data scientist is cleaning up data or people taking three weeks out of a month to do a monthly report because the data is not either of high quality or joined in the right way. And all of that manual effort of people trying to clean up the data. So that's often the easiest one to quantify. You could automate this process. But I think more importantly is the inefficient business processes. We don't have customer metrics on product usage. How can we improve our products? If we don't know the addresses of our customers, how do we make sure we have the right global coverage for sales staff? Data is your business. And I think that's what we often lose sight of or I've heard technical people say, oh, business don't want to get involved with data quality. This is their business, right? So if you sort of said, yeah, we don't, we have all the wrong credit scores for our customers and you say that to an insurance company, darn right, someone's going to care about that because I'm making business decisions on the wrong data, right? And it's so easy for us, I think, with not falling into myself when we start looking at data profiling and the data, you can lose sight of the fact that these customers are real world human beings. And this data is wrong about your actual human being or your organization, the companies of your B2B. And once you put it into that perspective or your patients, right, or your students or your products, and if you have the wrong product that someone bought, right, all of these are going to very much in a visceral way affect the day job of your business people. And if you haven't gotten buy-in from data quality, I'm going to be bold here. If you're on the technical side and the business doesn't care about data quality, you're messaging it in the wrong way because as soon as you have a real example, you might not care about data quality, but do you care that the list of products that this customer bought is completely wrong? And you're going to go talk to that customer, I don't look like a fool because you're going to talk about the wrong product, right? Anything can be related back to their day job, and that's how you get buy-in for this. Increasing revenue, that might be a little harder to quantify, but in terms of an anecdotal, that's absolutely why we're doing this, right? Can I optimize price to the right analytics? But if I don't have the data speeding that, right? I want to have the right we've touched on before, you know, quality data for marketing campaigns. We're going to do, you know, there's a lot of talk in the industry of AI and machine learning. Well, if you're doing AI and machine learning on bad data, you're going to get bad results. Garbage in, garbage out, like Nigel said. A lot of the reasons why people come to data quality, however, is reducing risk, right? We also don't want to end up in the news, like some of the stories Nigel told, or you will be fine. Things like GDPR, if you don't know, you know, where your customer data is, the quality of it, you know, there's actual implications or, you know, product traceability. We worked with a food company and they really need to know that both the lineage of things like farm to table where their fish came from, or is there any allergens in that food, et cetera, et cetera. And you certainly don't want to be the folks in the news for data breaches and things like that. The one that's, you know, sometimes funny, there's one industry leader, she has a book on crimes against data, right? We've all seen it of, you know, a loyal company, a letter that says, you know, loyal customer, insert name here. We're glad to have your business, right? These embarrassing things you do by having bad data or sending, you know, I just took a trip to Latin America for business and somehow one of the brands thought I was a 18 year old Latin American male and then the types of things they were sending me with email, I were in Spanish and had nothing to do with me, right? Because they had the wrong customer profile, right? So, you know, did that end them up in the news or maybe they lost a customer, but really overall, it's your brand trust, right? And especially in the day of social media, you know, these, these kind of quote funny data mistakes can be public or people post them on social media and it's not a great thing to have. So again, generally, when you're thinking of making the case for data quality or what the benefits of improving data quality, think of these four categories and you can probably apply any of these and most all of them to your business as well or your organization. Something we always want to stress though is in that list of benefits and risks, include that risk of doing nothing, right? When we talk to customers often we say what's the biggest risk you're looking at and they just say apathy and inertia of and I've seen this and I can understand it even running my, my, my own business and someone says, you know, we need better customer metrics and a salesperson might say, you tell me I don't know my customer, I've been working in this industry for 35 years and you tell me I don't know my customer and you data person need to tell me. But the risk of not looking ahead or, or using your gut feel and that might have changed, right? Are you looking at the most up to date? So what and having some of those absolute metrics of things that have gone wrong in the past or okay, Mr. or Mrs. Salesperson, you think you know your customer, but you know, have the information that you're making decisions on is wrong or, you know, really getting into that risk of doing nothing. Because that's often the bigger risk, right? We can't continue if you want to, maybe it's worked for you for the past 35 years, but that's generally what a failing company says before some other company comes and needs their lunch, right? You're not looking ahead. You're just kind of resting on your laurel. So using that kind of language and really looking at, you know, why do we need, what, what is the case for change? Why we need to do something differently? So, you know, a lot of, of that, you know, the business benefit and ROI is kind of a combination of storytelling and anecdotes and understanding the business, but also some hard metrics around both ROI and numbers around your data quality. Again, you can't manage what you can't measure and you can't prove what you can't measure. How do you know what good looks like and what better means? So we're big fans of documenting and doing, you know, if you haven't heard the term data profiling, which is really, you know, getting those metrics that we've all been talking about, how many nulls, how many, we've created these business rules of, you know, we're talking about student data. All students should be between six and 19. And if you have folks in the data set that are 45 years old, is that a mistake or that we have someone who's really old and in kindergarten, right? So we really need to think through a lot of these both, you know, raw business rules, but also business driven business rules and then update that opportunity slot. Why do we need to change this, right? I always got to say and embrace your inner teenager where you're always saying, you know, why are we doing this? This is so stupid. Why? That's a valid question to ask, right? There are a lot of things to do. Why are we looking at this data? And most likely, if it's the right data, you'll have a laundry list of why? Because if we don't understand our customers, I know I keep harping on customer because that's kind of the classic, but it could be student, it could be patient, it could be product, it could be constituent, right? Anything that your customer company relies on, if that's what you're relying on, having good information around that is definitely an opportunity or it's going to be an embarrassing issue that you don't want to be one of those things in the news, right? So to document that, and Ryan touched on this as well, there's different, you know, dimensions of data quality. And we found even though these can feel kind of academic, and on that academic note, I know we've got a list here and David D. Mbock has a list and a lot of academics like to have their own list. Generally, these are fairly common ones. And what I would say is don't argue that in front of the business, right? I think these are just good examples to think of. It's more than is it bad or is it good, right? Firstly, we've all touched, is it there, right? Do we have empty values where this is pretty important? Or, you know, is it consistent? That's, maybe in one system, my name is John Smith, and that's fine, but in the other system, it's Jonathan Smith, and then there's a Jay Smith or there's a Jerry Smith. Is that the same person at the same address as that John Smith's brother? Is it, you know, all that consistency across as core to master data, core to core to a lot of things. Conformity and, you know, Ryan touched on that as well, it's supposed to be an ID and it has letters in it or, you know, et cetera, et cetera, is it in the right range? Is it unique? We have duplicates as a huge one, not only is it a waste of effort, but how do we ensure that this is right when it's duplicated everywhere? Is it up to date? That's a big one, right? I'm making a decision on this data, and maybe the customer data is right, but it's from 20 years ago. Maybe that's not the best data to making decisions on. Is it accurate? And that's really where things accurate and reasonable, I think, are ones that really have that business view and I brought up age or, you know, I could do a, have a machine look at it. And yes, that is a correct date. It's 1-1, you know, 2000, right? Well, yes, but does that, if that's our, you know, customer base and they're all kids in kindergarten or student base and they're all kids in kindergarten, that's that age is going to be wrong, right? Or even just the classic 1-1-1-1, right? It is an accurate date format, but is it reasonable? And is it accurate conforming to our business rules? And is that actually, or it could be within the reasonable list, but it's January 1st, you know, 2010, but the actual customer's address is June 5th, right? And how do you know that? Does that need a person? Is that something you can validate against a third party? Should we be tracking it? Anyway, all of these are a business discussion and it's nuanced, which is why things like data governance come in. So something as simple as an address. Firstly, and I won't, I could do a whole webinar on this of what do we even mean by address? Is it mailing address? Is it physical address? Is it IP address? Is it presidential address? Could it be anything here? We're talking about physical address, right? So again, it's not an email address. So that could just be, do we have the right governance in place to tell people which address to put into this field, right? And hopefully it's more than one field if it's a physical address. So putting, you know, your email address is probably not the right thing. Completeness, you know, we have John Smith at 101 Main, something in Billings is that Montana is, you know, we don't know it's not a complete address needs to be filled in. Conformity to business rules, right? It's a PO box, but we don't, again, that's correct. It's a valid address. It's not valid according to our business rules. You know, is it reasonable? And again, these are sometimes the ones that are harder to catch AI can help with this business data stewards looking at it, accountability across the business process of how do we get something like 9999 No Name Street? That is a valid address, but should we validate it's in the postal code or the post office or, you know, how do we ensure that that's right? Without also going, you know, the other thing when we think about this is this important bit, right? I called actually 911 once because there was a fire downtown and I was in the city. I didn't know. I said, I'm seeing a fire across the way and they kept stopping and saying, what's your exact address, ma'am? I'm like, I don't know. I'm at a gas station, but I see a fire down. I'm just thinking of all the time they're wasting because they're trying to get a valid address. Meanwhile, the house is burning down across the street. They could have sent a, you know, car to that town, right? So again, going through all of these, I actually will prompt a discussion, right? Of, you know, maybe the lookup field is from USA and that's not, sorry, wrong, but we're using ISO country codes to characters that should be US, right? Are we doing the right integrity? Is it accurate? Maybe I did this validation and it's, you know, 117 Pawn's history, but that's not John Smith's address, right? Those sometimes are a little more tricky. How do I know that it's his or yes, maybe those are both valid addresses, but they're not consistent across systems. So does John live in two different addresses? Do you have two houses? Was that his older address versus his new, right? This can go on and on or maybe that's the valid address, but that's when I was in school 20 years ago and that isn't his current address and is it unique, right? We've all seen that even if it's 101 Main Street, which is the, which of those is the typo? What's the valid value? You know, that really gets into things like master data management. How do I get that single version of the truth? So those are all just some different examples of that, but hopefully kind of pick your brain and the difference of, you know, how do we solve any of those? Some can be solved by a computer, some need governance, some need discussion of what are even the right rules. So, you know, it is a bit of an art and a science, yes, validate and document, but also you really need that data governance and data stewards, business people really need to be in those workshops in terms of or meetings or discussions or however you want to do it, slack messages, whatever, whatever works for A, are these the right rules? How are we monitoring those rules? And then importantly, what do we do about it? And then how do we track the business benefit? So I'm going to pass it over to Nigel as he kind of goes into a little more nuanced around that. So Nigel, over to you. Okay, thanks Donna. Yeah, I mean, the chances are when you've gone through the first two phases of the A2E approach, you know, you've assessed the business usage of the data, you've done baselining of some of the key data sources. Chances are in our experience, this is usually true. You will uncover many, many data quality problems. So data quality problems in many organizations tend to be the norm, rather than the exception. So in order to make sure that you're delivering the best business value for the effort that you intend to invest in improving data quality, it's very important to be very ruthless in terms of convergence. And one of the reasons for that is I always believe that data quality is very much subject to the Pareto principle, which many of you are probably aware of. You've seen the 80-20 rule, which Pareto came up with originally I think in the 19th century with regard to P production, but it can apply to lots of different things as well. For example, a good example would be, you know, in my wardrobe, I wear 20% of my clothes, 80% of the time and 80% of my clothes, 20% of the time. And data quality is the same, I think, that if you can focus in on that 20% of critical data, sometimes called critical data elements that are the most important for your organization, then at least a majority of the benefits can come from improving the quality of that data. So it's very important in other words at this point to converge in upon the data that seems to be the most important to you. And when you're looking for these initial projects to maximize the return on investment, usually we'd always say look at two areas, one of which is to look for proofs of concepts or pilots, which may be relatively easy to kick off and to deliver, which could then lead to bigger things. So you might say, for example, let's just with that with our address information, let's just look at address completeness as a pilot, see how successful that is. And if that works out really well, we can then start to look at things like address accuracy and many of the other things that Donna talked about. And also as well, obviously then you're looking for the bigger projects which have the largest net benefits or will deliver the largest benefit to the to the organization. And I think it's important to stress that those two things, pilots and the biggest benefits aren't necessarily the same thing. But I think when you're kicking off data quality, it's very important to demonstrate early how important that can be. So how do you do that prioritization? Well, being simple folks in global data strategy, this is probably our favorite method of doing that, is that we take each of the data quality issues that we've listed in our log and we basically assigned each of those problems into one of these four boxes or at least straddle the four boxes into high benefits, low difficulty, high benefits, high difficulty projects, for example. And then your POCs will normally come from your top left hand box, high benefits, low difficulty. They could come from low benefits, low difficulty, but it makes sure that you actually get ahead and do some real data quality improvement quickly. Your biggest sort of benefit projects are very likely to be in the top right hand corner, but they may be more difficult to implement. So if you've got, you know, duplicated data in different data sources, then quite clearly that's a high priority job to do. But if you're dealing with trying to resolve many different data sources, that could also be quite complex and difficult. And of course, the ones in the bottom right, you'll probably never get run to do it, although you should never get rid of them because in the future, they could become more important than they are now. And then once you've gone through that process, what you can then do is sort of maintain your data and opportunities log, which I would stress by the way, you know, as you go through baseline, you can add detail to it. And as you go through the convergence phase, you can then actually start to give it a priority score. So if you look at the three examples that Donald mentioned earlier, you know, improving the yet the training of data entry people, because it doesn't involve new technology, it probably won't involve a change of process, could be quite quick and easy to do. So it could be both a quick win and a win with lots of benefits, whereas to deal with customer data duplication or inconsistencies of product data, both of which have high benefits, but could be of higher difficulty. And therefore, you know, when you kick off your program, you might decide, let's tackle number three first, and then tackle either one or two or both depending on your resource in order to progress things then you could of course then speed them up by doing a POC on one or both of those bigger, bigger, bigger benefit projects. So that's really important. So having decided which projects to focus on first, the next step is then simply to develop the improvements. And I would think the first thing you've got to do is think really hard about who should be involved in this improvement work. And because Donald said, and as I totally endorse, data quality should be owned by the business, not by IT, and therefore you want to ensure your business stakeholders are well represented in that team. So whether they're the people who produce the data, the people who use the data, or for example, a process owner where you think that process failings are actually causing data quality issues. And then you want your IT experts on there, obviously as well, because they will understand the more technological implications of some of the things that you're trying to deal with. You may as well, if you're looking at something like customer data, want to ensure somebody like your data protection officer is involved, because obviously you've got to make sure that any data quality improvements you make are consistent and maintain your regulatory requirements or your legal requirements. And that, you know, if you can align those with your existing data governance structure, if you don't have a data governance structure, this could be a really good way of doing that. And then you basically go ahead and you reanalyze some of the problems in more detail, and then you develop what we call a data improvement plan for each of the data projects that you decide to focus on. And again, this is a very, very simple concept, but we are surprised sometimes how many organizations don't do this, that once you've identified a data quality improvement project, if you like, then that project should be subject to the disciplines of project management and should be sort of fiercely managed as a project. And if you do it that way, then, and you do it in a consistent way, then you can roll up these data improvement projects into, in effect, what becomes a data quality improvement program. And therefore, that then becomes a more strategic initiative that you can then take around the business. So having done the improvements, the last phase is the evaluation phase. And as Donna said earlier, this A2E is a cycle, not a linear process. And this is where what Ryan was talking about in terms of data observability is really important as well. Because having made the improvements to people to process technology, you're then going to be sure that you are sustaining those improvements so that you're not constantly cleaning up data sources, which is, as Donna mentioned the phrase as well earlier, the cost of failure. And a really good way of doing this, and this is where I'm sure tools as well like Monte Carlo can help, we would always say, do two things. The first thing to do is to make sure when you're doing the monitoring of the improvements that you make, that you are relating them back to business benefits, as you can see on the next slide, where basically, you're not just measuring, okay, we've got two million duplicate records at the moment. As a business, we've made the decision, we need to reduce that very, very significantly down to a thousand. But the important thing about that as well, is how does that benefit the business? Because if you can't demonstrate how that reduction of duplicate records from two million to a thousand will benefit the business, the business will quite rightly say, well, why are we doing it? So when you come up with your KPIs and your measures, think about how you can measure things in terms of the actual benefit to the business, not just in terms of numbers of improvements to a particular data field or a data record. And I think that's really important. And then of course, once you've decided those KPIs, what you can do then, is create something like a data quality dashboard that looks a bit like this. And again, I'm sure what Ryan's been talking about is another good way of doing this, where basically you can constantly then monitor what difference the data improvements you're made are having. You can also monitor, then you might suddenly discover, for example, take duplicate accounts by region, suddenly you notice from doing these data quality dashboards on a regular basis, say every month or every week, that in a mere the duplicate accounts are rising again. So then you can start to pull out your A2E methodology, go back in, look again at the business problem, and then understand why that's happening, and then set up another specific initiative to tackle that until in a mere you notice the duplicate accounts are beginning to drop again. So by doing it that way, you're ensuring that basically everything, every gain you make is sustained. And that's really important in data quality. Back to you, Donna, just to close this down. Sure. No, great. And some great Q&A, and I do want to get over the questions, but yeah, you'll hear some themes, right, that it is a challenge, that's why we're all here. You need to have that foundation and a holistic approach of that people process and tech, and then that common methodology and approach, where it is a repeatable and improvements over time, you can't boil the ocean, but you can start in small wins. Just quickly, if you like what you see, please join us next month with the metadata management, which is closely aligned with data quality. And with that, a little bit of a blatant plug that we do this for a living. So if you need help, come find us. And with that, I will pass it over to Q&A. Awesome. A wonderful presentation as always. Very much appreciated. And we have a number of questions in the Q&A. Top of the list. Trusted data is a commonly used phrase. The data mesh often talks about trustworthy data. Is it just semantics, or is there something different about trustworthy data? I have my thoughts, but I'll let Ninzo answer that one first, because you've been talking a lot about this. Go ahead. Yeah, basically, I mean, to me, the two terms are pretty much synonymous. I think that a good thing about data mesh is that it puts primary responsibility for data quality with data producers, which is where it belongs. So in other words, at the source of the problems rather than where the problems are sometimes felt. And maybe the only difference is with trustworthy data is because the data mesh relies heavily on the concept of data product ownership, then you can almost give it a trust mark because you know that that data is being carefully monitored and carefully managed and carefully stewarded. But to me, the two terms are pretty much identical. Yeah, I'd agree. I'd agree with that as well. Ryan, do you have any thoughts on that? Nothing in addition to add right away. I'd say similar. I've heard them used somewhat interchangeably, which is, yeah, anecdotal from my side. Yeah, that makes sense. All right, Mark, what's next? Yeah, we've had so many successful pilots, this questioner says, but few scale properly. If the objective is to scale, would you adapt the A to E process at all? I'll chime in and then pass it to you. I know you're passing this as well. I think the A to E process is designed to scale. I mean, how we do all of our projects, whether it's data quality or otherwise, is you need to have that big picture. What is the overall data strategy, which is in our name, which is the overall data architecture, and then how do you build towards that in an iterative way? I think you want to make sure that these first quick wins are within a bigger picture and then the so what and what are you going to do about it, and then get the bar in mind and ROI, learn from that and move on. I think you want to begin with the end in mind to use a popular phrase, but I think that's part of it, of just tying it into a bigger framework and then continue to with those improvements and don't stop. Sometimes it's easy to say, okay, we won and kind of forget to fully scale it or integrate it into full organizational change management. That's probably a lot of things to do after that first pilot, but Nigel, any thoughts with me? In terms of the A to E methodology, I mean, one of the strengths of it is that you can use it as Donald said, at an organizational level. So when you first start to think about how we're going to solve our organization's data quality problems, then certainly you can use A to E to give you that broad overview and to identify if you like what the key domains or key areas of data quality issues are that you can then drill down into and focusing. But even when you then get to the level of an individual data improvement project, you can still there validly use the A to E methodology to help you drive your project. You know, you still need to understand that the lower level in more detail, the business context of the problem that you're trying to solve all the way through to and how do we then evaluate if we've been successful or not. So I think an advantage to me of something like A to E is that it is inherently scalable both up and down. Yeah, makes sense. All right, Mark, what's next? Do you have time for a couple more? We have time for a couple more and we've got a couple of very similar questions in here, which makes me very excited because I love thinking about this particular problem. What's the best way and do you have any advice on cleaning customer address data? Do you use official sources to validate against or any other recommendations that you may have? I'll chime in. I'll go ahead and I'll go ahead. No, I'm just going to say, Donna, just I don't know the, Donna can hopefully elicit date on the US in the UK. There is a data source called the postal address file or PATH and many organisations for a fee can access that PATH and it does give them what is supposed to be a definitive data source of all the physical addresses within the UK. It isn't 100% perfect and a lot of tool vendors, particularly data quality tools vendors, can offer the PATH as an optional extra that you can pay for that their tool will interact with and create an API into the US. Yeah, I mean, there's similar data sets across the world. The US Postal Service is one, I mean for organisations, something like that in Bradstreet in terms, I think with address so it's tricky, right? It's easy to validate against the source like postal service. Is that a formidable mailable address? Think back to all those mentions, is it the right person's address? Is it when you think of organisational data, a lot of the problems we've had, even with the data model, is that a mailing address, a billing address, a ship to address, people using these things for a ship to because there's more than one site at an order, right? So I think the nuance around address is, yes, it can validate, but I think a lot of it's the usage or a data model or that is stewardship around it. So be careful about that or overvalidating. I told you my 911 here in the US, emergency number, my parents are trying to get a washer dryer delivered the other day and they couldn't validate against their address and they were at which end being 90 years old and not being able to solve that and they lost a sale because something they were overvalidating on address, right? So I think, you know, basically can I validate against the postal service? Yes, but I think generally what we see, a lot of it's the bigger picture stuff around address or is it up to date and think of the dimensions of data quality before you over go and maybe take from a third party. Third party is only a part of the problem, right? But Ryan, did you have any thoughts on addresses near and dear to your heart? Not particularly, I kind of had the pleasure to do that type of validation myself. Sorry, you're missing out on all the fun. It is the classic, it is the classic problem that we all like talking about and with that, we are at the top of the hour again. So thank you very much to Donna, Nigel and Ryan for the webinar today and we will be sending out the recordings and slides within a couple of business days. So everybody have a wonderful rest of your day. Thank you. Thank you, Mark. Thank you everybody. Thank you all. Bye.