 Hello and welcome. My name is Mark Horseman and I am a data evangelist with DataVersity. We would like to thank you for joining today's DataVersity webinar, Getting Data Quality Rights, sponsored by RELTO. It is the latest installment in a monthly webinar series called DataEd Online with Dr. Peter Akin. Just a couple of points to get us started due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them via the Q&A section. If you would like to chat with us or with each other, we certainly encourage you to do so. And just to note the Zoom chat defaults to send to just the panelists, but you may absolutely switch that to network with everyone. To open the Q&A or the chat panel, you may find those icons in the bottom middle of your screen for those features. To answer the most commonly asked questions, as always, we will send a follow-up email to all registrants within two business days containing the links to the slides. And yes, we are recording and will likewise send a link to the recording of this session, as well as any additional information requested throughout the webinar. Now let me pass it to our sponsor, Reltio and Mike Frasca. Thank you. Thanks, Mark. Hey, everybody. I'm Mike Frasca. I am Field CTO here at Reltio. And we're going to talk a little bit today about data quality, some of the data quality challenges that we see with our customers and some of the things that they're looking forward to in order to improve their data quality from now into the future. So today, data silos and complexity of applications really accelerates data quality challenges for our customers. Gartner estimates that the average enterprise has 446 different unique applications that are disparate and don't talk to each other, right? That's 446 unique data silos, 446 different entry points for bad quality data. And that becomes more and more of a challenge as more and more SaaS applications enter the enterprise. At Reltio, our focus here is about simplifying that problem and delivering a unified trusted core data in real time, both to analytic stores and databases, but also to other back to SaaS applications and marketing applications to make it easier for data quality professionals to be able to access to that data and get use of that to actually drive business outcomes. So at Reltio, we are involved in a bunch of different market segments. We are involved in life sciences and healthcare retail financial services, but the common things across all of those is that customers need to understand their data and make better use of it. And so things like data quality and data governance and entity resolution are table stakes, right? That's a core capability that our customers need to get out of their data in order to make sure their businesses are performing in today's marketplace. So as we talk to these customers and we talk to people in the industry, there are three major kind of pillars that a lot of them are focused on around data quality. The first piece being assessing data quality, struggling to understand even what are the data quality gaps that they have, the ability to understand how those compare to industry benchmarks. If I have an 80% fill rate for my addresses, is that good? Is that bad? Who knows? And so understanding how that compares inside the industry is really important. The ability to manage that data in real time and be able to measure that in real time, be able to identify anomalies and reduce manual effort. So how do we automate a lot of that management? And then enriching that data. And so this is not only enriching data with third-party data sources like Zoom info and other reference data sources like done in Bradstreet, but being able to enrich their data with other applications in their enterprise. Your customer service application in your enterprise may have information about a customer, such as an address or a phone number that your shipping system doesn't know about. And so sharing that data and enriching that data is really valuable to be able to make a difference across how you treat your customers. And so today we are really focused on solving some of these challenges. This is where we are making our largest investments and where we think the market is going. One, real time integrated access to data quality. This isn't just about real time view into what is my profiling stats, what's the quality stats. But this is also real time access to the end result of that data. Because it's no longer acceptable to be able to have access to this data a month ago or a week ago. We want to see what is our quality of our data today right now at this very second. The second piece is understanding industry benchmarks. And so once again, just getting an understanding of how many addresses you have or how many unique phone numbers you have or how many customers you have, that's not good enough anymore. Is that understanding how that compares against your competitors, how that compares against other people in your industry is really key to being able to focus on where you can make the most impact across your business. And then the third piece is around recommendations and automated recommendations. This is around automated cleansing of data, automated recommendations towards third party data sources that could clean up your data or enrich your data. And we really believe that AI and ML are key to being able to do this in a really highly automated manner in order to reduce manual effort. And so the last piece I want to talk about is that this is one of our customers who has really made a large investment in data quality and that's been their key focus. So this is Schneider Electric. They are an electric distribution company that was founded in 1842. So they've been around the block for a while. They have a lot of information. They work in a very highly regulated field. So they focused on unifying data across 20 different systems and that comprised about 5 million organizations and about 13 million individuals along with enriching the organization information with integration to Dunn and Bradstreet data. And they are really focused on how do they improve that data quality to have more sales effectiveness and save on shipping costs. So some of the outcomes they used to drive out of that are being able to discover millions of new sales opportunities. They saved hundreds of thousands of dollars in shipping costs by having better cleansed quality address data and they were able to reduce the time it takes for a new customer to be onboarded in their systems by 50%. And so think about how fast you can engage with your customers, what a better experience that is for your customers. All of that is related to just better data quality across the organization and taking more effective use of that data. So that's what I have today. So I'm hoping to talk when we get to questions. So Mark, I'll turn it back over to you. Thank you so much for that wonderful welcome and slide to deck, Mike. Now let me introduce to you our speaker for today, Dr. Peter Agen. Peter is an acknowledged data management authority, is an associate professor at Virginia Commonwealth University, past president of DAMA International and associate director of the MIT International Society of Chief Data Officers. For more than 35 years, Peter has learned from working with hundreds of data management practices in 30 countries. Among his 10 books are first on CDOs, the case for data leadership, the first describing the use of monetization data for profit and good. And the first on modern strategic data thinking international recognition has resulted in an intensive schedule of events worldwide. Peter has also hosted the longest running data management webinar series hosted here at dataversity.net from 1999 before Google was big and before data science. He founded Data Blueprint, a consulting firm that helped more than 150 organizations leverage data for profit improvement, a competitive advantage and operational efficiencies. His latest venture is anything awesome. And with that, let me turn everything over to Peter to get today's webinar started. Hello and welcome. Hello and welcome, Mark and welcome to everybody. Thanks for joining in today. Our topic today is going to cover sort of three major areas. The first one is approaching data quality and we'll talk about a number of considerations in that then we're going to talk about what do we need to do to get better at data quality and then we're going to talk about how we get better at data quality. And then at the end of the hour, we'll look back to anybody, Mike back on and see what you guys can stump us around some of these good questions that y'all are already having in the chat. Again, we love the dialogue. So thanks for tuning in. In the 90s, as Mark mentioned, I wrote a book with Clive Finkelstein, which still actually has some really good stuff in it. But in the book, we both put in some words that were why haven't organizations taking a more proactive approach to data quality? And the answer turns out to be a little bit scary. Fixing data quality problems is not easy. Mike was just describing to you situations that they have run into in the past and these are simply non-trivial challenges because data has been under resourced for many, many years. In addition, as soon as you start talking about projects and data quality and things like that, they turn around and come after you. Your efforts are likely to be misunderstood because management thinks this stuff is easier than it actually is. And quite frankly, we've seen some organizations have their challenges get worse before they get better. And now congratulations, you get to fix it. A single data quality issue can grow into a significant unexpected investment or cost depending on how things are working in those areas. And I will say that of all the things that I've written, this page has been up in more cubicle walls than any other page that I have written on this. And Clive was also quite pleased with it in the long run to make us see that. So let's talk about something. And I'm going to change title of this slide in real time here for you. It originally said making warehousing successful, but nobody cares about warehouses. Today, we all want to do cloud. So let's do cloud. Unfortunately though, I think you'll see the answer is the same, whether you're moving data to warehousing or cloud or whatever the topic is around this. So when we look at what's going on in this context here, I have three rules because what happens to most of the projects that I've observed over the last 20 years of people moving into the cloud and things like that is that they do something called fork lifting the data. You can see it illustrated right here on the screen. We just take everything that we have and throw it in the cloud. Now, there's some problems with that. First of all, there is no basis for decisions being made about whether this data should be included or not. There's no inclusion of architectural and engineering concepts in this movement of data. And there's no idea that these concepts are missing from the process. What does that leave us with? Well, too much of a bill out there for whatever cloud provider it is that you're trying to work with. Especially when you consider that all the research in the world shows that 80% of organizational data is wrought. Wrought is an acronym standing for data that is redundant, obsolete or trivial. And the question should come to your mind, why then would I want to spend money storing it in the cloud or anywhere at that point? So a better approach to this is to look at the cloud as an inflection point, just the same as a warehouse is a transformation point. And that we should take that data, but instead of just fork lifting it in, we should actually transform it. Because data in the cloud should have three attributes that data outside the cloud does not possess. The first one is that it's less in volume. The second is that it should be cleaner. And I hope you agree with us. It should be at least as clean. And if you're moving things, you're likely to break something. So you need to do something proactively to make sure it is cleaner. And the third concept around data in the cloud is that it should be, by definition, more shareable architecturally than other types of data that are out there. This also, of course, gives us the opportunity to do something we call data branding, which is the idea that we are now going to say data can come from a certain source if we know what its characteristics are, whereas data that has unknown quality characteristics should be more suspicious or more suspect around all of these things. And of course, the real key is that once you move your data to the cloud, now you get to fix your data in the cloud. And that's like using a glove box. You can see how this is clearly safer because there are things on the other side of the glass that we don't want to have in contact with us. But at the same time, it is more difficult in order to do this. In fact, data quality problems, I tell for most groups, are really kind of like sitting at the bottom of Niagara Falls here. And somebody telling you that there's a water quality problem upstream. Well, where are our chances of fixing it if we are here at Niagara Falls? The answer is absolutely not going to happen. Let's talk about what those attributes of data quality should be. And we break them up into four categories. The architecture has quality attributes. I've listed them for you here. I have an appendix to this presentation. You'll get all this stuff at the end of it with the details as well. In addition to data architecture quality, you can see that each data architecture is going to be comprised of one or more models. And of course, the conceptual quality of those models is absolutely crucial. Data quality value is another aspect of it. And then data representation quality. Those of you that are headed this week out to Enterprise Data World, one of my good friends, Mike Schofield is going to be doing a talk on visualization that talks about that representation quality very, very easily. Wonderful recommendation. He's a great presenter as well. If you notice on this chart, though, the way I've set it up, the things that are on the left-hand side of this chart tend to be practice-related data challenges. There are many of them and they are closer to the user, which means that things on the other end of this scale tend to be structure-related data quality challenges. There are fewer of them, and they are more remote to users and management. Boy, I see a typo there right there that I have to fix, but you guys get the picture on it in terms of what we're looking at. So when we talk about data quality, we're really talking about four different dimensions. Again, in blue, structure-related, in brown, practice-related. And unfortunately, most organizations aren't aware that this is the way in which they need to use to attack these kinds of problems here. Our definition for data quality has been for many years data that is fit for purpose, which means we don't know until we look at the data in context where it's actually going to show up. And that those practice-related issues are failure in capturing or maintaining data, allowing the incorrect data to be collected when requirements specify otherwise are presenting data out of sequence. Whereas our structure-related ones are problems where data is imperfectly arranged. And we may have a difference in specification coordinates that the data may have been captured, but it's not accessible or that we've provided incorrect data in response as the correct response to it. So knowing that you have to separate data quality into these two pieces is actually very critical. Let me give you an example of a practice-related data quality item. If you've ever had knee surgery, I have not, but I know several people who have. They actually make you write on the day of surgery, do the right knee, not the left knee. This is data quality remediation. And by the way, the doctors don't mind at all because we have managed to reduce the number of times when a physician will actually operate on the incorrect knee. So that's a data practice-related problem. Let me talk about data structure-related problems. For just a quick second here, New York City had a big tree problem that they were trying to figure out. And in the 11 months between 9 and 10, there were four people killed or seriously injured by tree limbs in Central Park alone. So the arborist on the staff of New York City said, we believe that doing maintenance to the trees can make them healthier and more likely to withstand a storm, but they had no data to back it up. So you sort of see the problem. They're trying to figure out what they can do. And the question was, does pruning trees in a year reduce the number of hazard conditions? Well, there was lots of data, but there were granularity challenges. Pruning data is recorded block ID by block ID, whereas cleanup data is recorded at the street address level. And gosh, darn it, trees don't have unique identifiers on them, even in New York City. So by pulling all this together and downloading and merging it, doing a lot of really careful munging. And I'm going to come back to that word munging in a bit. They were able to look and see, wow, there were certain types of hazards that could cause a 22% reduction in the number of times the department has to send crews for cleanups. So the best data analysis is always going to generate new questions. And New York City can't prune each block every year, but they can now, with this more correct integrated data, specify block profiles and decide whether these blocks should be looked at in the way of all of these things being recorded. Another component of all this, with the structure related components, has to do with deciding on who is who, a whole concept of identity management. And gosh, if you've never had the fun to figure out who somebody like Jones Smith actually is, it can be a real challenging exercise around this. This is an exercise from SAS. And I love it because it exemplifies it very, very nicely. J. E. Smith came in from a website because they have an autocomplete. Thank you, Microsoft, for putting some stuff in there for you. The call center may have had wax in their ears and say, Joni Smith instead of Smith. So Joni Smith gets another entry in the customer database. Customer database is already there, but she's a Joan rather than Joni or J. E. And finally, somebody on the third party mailing list has spelled her name incorrectly, Joan E. Smith. And it gets there. We don't know, of course, that it's all the same person on the other side of this. And this is the biggest barrier at the moment to going digital, because everybody has this on their guideline. We've got to go digital, but it isn't possible to go digital without spelling data and adding at least some things to the process. For example, every digital exercise I've gone through with an organization has at least one person. They're picking on Nebraska here with the project scope has some random person in Nebraska maintaining something that you can see is key to that entire digital initiative. Well, unfortunately, we're just not as aware as a society as we should be in here. So we need to do better and figure out how we can actually get data to help out with our digital efforts. And again, I tell you this to show you the level of understanding in society is kind of like Jimmy Kimmel going out on the streets and handing people maps and saying show me where this thing is or where that thing is. This was literally on LinkedIn within the last couple of months here. Wow. Okay. Garbage in plus anything awesome is still going to give you garbage results. Now, that's not new to you all. I know you understand that. But here's the sad part. What happened here was that this individual then posted, Oh my gosh, I just realized this is true without blockchain. And it's true with blockchain. And of course, the answer is yes, because anything in the middle is going to be problematic when we're looking at it. So this is a recent technology, excuse me, recent technology realization. This really should be a fundamental technology realization where we start out at the very beginning and let people know that garbage in garbage out. And you can see here in the center, it doesn't matter whether it's blockchain or AI or MDM or data governance or analytics or whatever it is we're putting in the middle, it's all subject to garbage in garbage out. And this is always true. So one of the key takeaways from this is don't start with the technology. Let's start by making the data actually work because the data can teach us things about it. In fact, one of the easy things to look at from a data perspective on this is to consolidate data flows. And I have saved organizations hundreds of millions of dollars by not sending data around the same data twice, three times. I've found it many instances on this. So we consolidate the data pieces. That sounds good. Let's replace the poor quality data with good quality data. And then we can feed that into these various technologies that we have. And only now can we actually evaluate the results and see what happens. One more quick piece on machine learning, which is again, something Mike referenced here and said correctly that it should really be pointed at your existing data rather than going out and getting new data because you know your data much better than everybody else. And yet for machine learning, we see three years of un-productivity where the data scientists are trying to learn how to do data quality, which is not something they've been trained on in order to do this. And this is one of the reasons that machine learning efforts are failing so miserably in today's environment. If anybody wants a good data AI story on this, hit me up next week at EDW and I'll tell you a little bit about it. So let's get to a couple of definitions. I've already told you that data quality is fit for purpose. It tends to be synonymous with information quality and we most organizations set up a management program around this. Now hopefully some of you are at least curious about the Popeye story. It turns out that the Popeye story came from somebody who was working in the Bureau of Statistics in there and they got an error incorrect on that. And when they did it incorrectly, it ended up being a decimal point error. So spinach looked like it had a lot more iron in it. Therefore it was more valuable to all of the people who were eating spinach. It was never true. It has never been true and it's never going to be true. And unfortunately we've just taken the entire myth away from Popeye, but it does show you how these things are problematic. Finally, the last point on the definitions page is that we need to understand that data quality needs to be an engineering program. It can't be managed if it must be engineered because these concepts are not generally known throughout the rest of the business. Again, they measured 3.5 milligrams of iron, which is 100 grams of spinach, right, which is in 100 grams of spinach, but it was recorded as 35 milligrams of iron and it wasn't corrected until 50 years after the publication of this. So people read things in accurate journals and the things they read were incorrect. I love that story and I love this one as well. Data is very much about the princess on the P type of story here because data is this sort of thing at the bottom that causes a lot of other things to go wrong. And in this case, the princess is sleepless, but we'll go ahead and move on to the leveraging, which is probably not something the princess or management generally understands, but all of you do. You can see that if I have a 1 kilogram round ball and it's trying to balance out this 100 kilograms, it's not going to work unless I have a very, very large lever. But if I put a 10 kilogram ball on it and do some math and figure out where that should be, I can figure out how to lift it. And if I want it to go down, I move it to 11 kilograms. Again, humans can lift a bulk that is much more than a human weighs in this case. And leverage in data is exactly the same thing. When you apply data quality to these things, you have your organizational data instead of the weight. We have some technology that we use. In most cases, people understand a rule or a lever that we can use. This is actually a two technology piece. We have the lever and the fulcrum in order to do this. And of course, we have our people. These are our knowledge workers in the organization, oftentimes supplemented by very talented data professionals. But I've seen also where they've come up with their own kinds of things. Finally, we need a process in order to do this. Three legged stools are better than two legged stools. I hope you agree so people process and technology guided by some form of strategy in here. And if we take the data organization and reduce the rot by even a small amount, that increases our data leverage. So this leverage concepts permits organizations to manage their data better within the organization and with their exchange partners. And most importantly, focus it on the organizational mission. It also allows us to leverage the leveraging capabilities that we have by focusing on non redundant obsolete or trivial data. The bigger the organization, the greater the potential leverage that exists. Treating data more asset like as in improving data quality lowers it costs and increases knowledge worker productivity. Both of these things are things that you absolutely want to do in your organization. Here's a very concrete example of this as well. This is an example that we use in the area of metadata, excuse me, master data management. We start out with reference data. That's that little tiny blue dot on the upper, almost in the right hand corner there. And that reference data, it says things like what countries do we do business and what types of accounts are available and what are the controlled vocabulary items. So if we don't get the reference data right, we might not, for example, be able to do business overseas, unless we have significantly reengineered the original app, that reference data guides something called our master data or main data or golden data. This is things like are you a member of our premium club? Do you have authorization to do these kinds of things in here? Or are you using common or standard data structures? And then of course, we have all of the other data that comes with it in the world. And again, this is transactional oriented data. This is the five bucks that you get for a sandwich, or are you authorized? Or even if you're just liking something on a social media site, all of these things are wonderful ways of controlling this. And you can see the leverage here. If I don't get it right at the reference data, I have the same exact situation as the princess on the P. My thanks to my friend, Chris Bradley, who really did come up with the concept behind this and came up with a really good way of expressing this. Let's do a couple of other things that are not always recognized as data quality challenges. First one is an IRS. And our wonderful in the US here is our tax authority piece. And we had a great hit piece, if you will, that came out in the Washington Post where it said, Oh my God, the Treasury has sent one more than one million coronavirus payments to dead people. And that sounds like it's a horrible thing. Well, let's actually look at the numbers. And again, thanks to Slate for doing this. We sent 160 million payments. The ones that went to debt taxpayers, 1.1 million. The error rate on this, 0.4%. Yay, government, right? This is actually a really big success story that the IRS was able to do this. And it had, you know, some money in it. We pushed a lot of money out to people quickly because we had an economic crisis in there. Lawyers had determined that the IRS, if you happen to be eligible for the tax subsidy, or the refund on this, and you happen to die in there, we were not allowed to take that data away from you. That belongs to your estate. So once again, of course, we sent this to dead people. So again, Washington Post do a better job on this, please, next time and take a look. Again, the headline should have been that the government did its job really well. Speed was the priority and within two weeks, the IRS delivered half of these payments electronically. That is an unbelievable thing. And that speed was part of the ability to get spending going on normally in the middle of this recession. Wonderful, wonderful story. Really good, interesting data quality story. Another one here. This is one that was sent to me actually at a company that I was at at the time. The bank didn't know it made an error. Tools couldn't have prevented this. They lost confidence in the ability of the bank to manage funds. We actually got on the phone and we're talking to them and said, where can I spend this wonderful gift card that you give them? And they said, oh, you can spend it anywhere. And I said, well, could I go buy a car with it? And eventually the customer service agent went, whoa, did we really send you a gift card for $0? And by golly, you can see here, yes, they had in fact doing this. I've been a member of the IEEE for more than 30 years. I signed on the other day and it said I've been a member for three minutes and four seconds. Yeah, these are data quality errors that make people have less confidence in your ability to do things. Here's a great one. On the port of Seattle, they needed a trench for electrical cables to be specified at 2.5, 2 inches. And instead, it got rounded to 2.5, which required a million dollars to go rent other facilities while we got the new cable on this. How are we going to go back and find this out? These are very dear costs that occur for data quality. And a wonderful example here that I love to talk about a lot, which is, are you going to tell us the chocolate story, Peter? Again. And when they say again, you've actually made a very significant breakthrough because they remember the story. And the story here, many of you know as well, if you're trying to sell a lot of chocolate at times of year where people buy lots of chocolate, it makes sense to have your systems available to sell chocolate and you don't want to have IT problems when you're trying to sell. Again, big organizations will lock down their entire facility and say, you simply can't make any changes here because it's so important that we sell during our peak season. One last quick little example here. In the British environment, they've got 17,000 pregnant men out there. This is a coding error that occurs. Somebody made a misplaced keystroke. And now we have pregnant women in Britain. Well, of course, we know that doesn't work in there. Again, another example here in Microsoft. If you're keeping stuff on, oops, I'm sorry, if you keep stuff, track of stuff on Excel and you're using the older version of the Excel spreadsheets that end in .xls, then you start dropping data after you've put 65,000 rows into this. Well, they again trace this and found out that all sorts of things that were called IT errors, but they were really data quality errors here using what should have been reasonably normal tools that were not problematic for people. Once again, another quick one here. I'm a guy. I've got a microwave. I heat up stuff. Right. Everybody understands the microwave turntable piece. And that microwave oven cost me $40 being clumsy. I dropped it and wanted to fix it before I could actually get accused of breaking it. They're not trying to hide anything, just saying I really wanted to fix it. So I get on the GE appliances website and I look for it and say, what am I looking for? A repair part. Good for microwaves. Again, everybody's got it here. Right. Yes. Yes. A countertop microwave. So I put in here and I say I'm looking for, in this case, a turntable. Right. Don't have any turntables in there. Can't find them. You don't want them. Go back to zero. Right. Well, you know what? If I finally look around a little bit, I found a schematic for this thing and it turned out it was not called a turntable. It was called a tray dash cooking. Okay. Just what I would need to have. Right. Not where I would have gone originally. Clearly a discrepancy between the consumer end of things and the engineering end of things. But also look at this. When I go to buy the tray cooking, it costs $48, $8 more than the microwave cost in the first place. Add onto it $9 delivery charges. Of course, that's exactly what I'm going to do. I'm going to throw the old microwave out, which seems like a really sad thing. And buy a brand new one just to gain a new cooking tray. Is this a data quality problem? Yes. Absolutely. It is a data quality problem. And what this gets us to as a conclusion is that many people really do not understand data because they approach it from a wide variety of different perspectives and that it looks like different things from different data perspectives. This was a commercial that we saw in the airports before the pandemic. Whoops. I'm so sorry. Go too fast. There we go. It's the idea that in certain parts of the country, people love white cats or hate white cats. Right. And again, goes through each of these various pieces. All of these data quality challenges are unique and context specific. And yet our data knowledge out there in the world is too little and too informed. It happens pretty well at the workgroup level because then it actually turns out to be a defining characteristic of a workgroup. Well, without the guidance, what are the chances you're going to pull this stuff together? That's a guy named Wally Easton in case you can't read the little fine blue print over there. And I use this clip a lot to show things because this is what happens in your organization when people who do not have the requisite knowledge and skills try to go about changing data. They might actually do this, but it turns out if our goal here for Wally was quality, Wally really failed that particular piece. Right. He's got the speed part of it perhaps because bad data becomes chaff in our organizational machinery. I was this morning down at one of the local military bases near here and they have some very explicit things. They have a piece that goes off and 18 hours later, things have to happen. Right. That's how tight the planning is. And gosh, we sure want our military to be supported and do the right kind of planning and we want not sand in their gears in order to do this. It just doesn't work. One final note on this data is not the new oil because most organizations don't have the knowledge and skills. Instead, I like to say data is the new soil because it gives us a little bit better focus on this. Again, you put data out there. We've got this, but we now need to clean the data, organize it and group it in order to get things, changing the order field, sequencing, correcting factual errors, whatever it is that we want to do. And we'd like everybody who'd want to do better data analysis. So you're all agreed. I hope that some data preparation is required in order to do good quality data analysis. What would a good ratio be? Maybe people say 50-50. Most people actually say I'd much rather have 80% of my expensive data scientist time focused on doing data things and not really working on data quality. Turns out the actual ratio is 2080 and everybody knows. Everybody knows why don't we understand that all organizational challenges have data root causes in this. It helps if your organization has a burning bridge. Don't ever let a crisis go unanticipated. The Star Forward went down hard and so did Nike. The company's stocks sliding on the news, closing the day down nearly a point, losing an estimated billion dollars in value. You didn't get to see that little quick. This is Zion Williams, one of the NBA stars, wearing a brand new pair of Zion Williams brand sneakers that broke within two minutes of starting the game. Not a good place. And what that means is Nike had a burning bridge issue. There was something happening there that shouldn't have been happening. Somebody needs to fix it and now we're back to the place where we started a half hour ago. You've got management's attention. You really need to make sure they understand things because too often it turns into do something which turns into buy something mostly technology placed. And oh, by the way, get data quality in whatever that means. Well, a fool with a tool is still a fool. And we do accomplish something. We use all the project budget up, but in this case, we don't actually fix the problem because it's not the case that a tool alone will fix it. We need a programmatic approach. Let's take some math really quick. If I invest X and Y, then outcome Z should result and Z should be greater than X. Again, if I invest $100 in fixing some data quality problems, I better get more than $100 worth of value from that in order. And at the beginning of the project where we're just trying to figure out what's going on, we don't have definitions or agreement on any of these things in order to do this. But if I start to get people and they say, oh, so what? You cleaned 100 data elements or 100 items or whatever. But as soon as I can turn this into data for 100, cleaning one data set, and now I've got $1,000 presentation, management cares at this point. You've got to put it in this way in which they will understand. And again, when will it be done is another question that comes up oftentimes as you start your data quality journals. Well, this means that you're going to have to explain to them the difference between programs and projects. And your data quality program was going to last at least as long as your HR program. And that is what you have to drive into their minds, because data is going to be more important in the future or less. Well, I predict more. Let's see if we can get everybody on that same page. It's not a project. Data really is a durable asset with a valuable life of much more than one year. And while it's reasonable in development to specify two-week or 90-day increments, data evolution really should be measured in years. It evolves over time. It's significantly more stable. And it's a prerequisite to any sort of agile development processes. So to have a data quality program, you need an ongoing commitment that permits evolutionary improvement. You need some governance around it with senior level involved in it so that they are paying attention to the money. You need executive capabilities who need to be educated about this. And you need to understand budget priorities, senior attention, and reasonable time frames in order to do this. Let me give you an example of how New York City does it when they're repairing bridges. Those of you that are up in New York City understand this. They hire a crew and they paint continuously. Why? They know how many people it takes to paint it. So when they finish the end of the bridge, it's time to start over at the other end of it. And they are also aware that the rest of the maintenance goes on here. And they just keep repeating this process and ensuring that they have a knowledgeable bridge literate workforce in keeping track of that. Well, let's talk a little bit about what I call the data sandwich in here. And again, the idea is that if you're going to do anything with data, it's generally going to require high performance automation and some combination of literacy, data standards and data supplies. And what our goal is, is to even these things out so that they will work in conjunction with each other. Because if they were all raggedy like I showed, I couldn't put them together like this and get them to work the way they should. And this cannot happen without engineering and architecture. Quality engineering products do not happen accidentally. If you need to illustrate that to somebody, pick up any iPhone and ask how long did it take Apple to get rid of the home button? The answer was 10 years. Well, same thing, of course, is true. If we add the word data to these things, quality data engineering projects, products and quality cannot happen accidentally. By itself, the words likely to come up with really confusing things. Let's give a quick definition of engineering here. This is something you may or may not have heard about Texas A&M had a awful bonfire collapse and they went in with a forensic analysis to see what was actually happening in there. What you can see here is that part of the logs broke at one side which caused the entire structure to lose structural integrity and in fact collapse. And what happened to this turns out with the post tragedy analysis was that originally the Texas A&M students were told to tie, by the way, a bonfire, the person is half the size of one of those logs. So it's a very large bonfire, think Burning Man type of thing. So they were originally tying those logs that burst together with a certain structurally strong set of rope and bindings in this and it became over the years to from tie them together with this type of rope to tie them together. And that is clearly not an example of data quality engineering. And we need to move to this data quality engineering if we're going to get these things to work correctly. So a bit about approaching data quality. Let's now talk about what do we need to do to get better. And the first one is to understand and it's practically impossible to understand almost any system as just the pieces of it. We need to look at the context in the relationship to the whole. So again, you need to see the forest and the trees simultaneously if you want to do a proper job on this. Let's take a very simple example that will be easy for you guys and you can use these examples and please do. Some of you have come up with better ones by the way and donated them back here. I try to give credit where that happens in order to do that. So here's an ipodiagram. And again, the process is called Nate Pizza, but my inputs are just dough and water problem. Yes, absolutely. And what can I do about it? If my output is supposed to be pizza, I have insufficient inputs in order to make this process work to produce pizza. So maybe I need to do not pizza, but say I'm just going to make a crust. That's okay. We can cancel that out and say make a crust or we can add the other ingredients and do this. The point is that anybody in your organization can follow this type of thinking and as they start to understand this type of thinking, you can then educate them to the other parts of this. If I'm working in a process in a business context, what are the levels of quality that are required by the processes that I govern? What are the processes that govern them? I have some fiduciary responsibilities in there and what quality attributes are required downstream by further consumers that are down in the role. Now, we use this calling root cause analysis in that and that makes perfectly good sense. But let me just give you an example of how this occurs. This was actually a Walmart related project. They wanted to know why infant mortality was high. The answer turned out was malnourished mothers and why were the mothers malnourished? Well, substandard biology education in high school. Ouch. And why are the biology programs substandard? Because it turned out, at least in this part of the United States, they were giving poor quality education to high school biology teachers. Why? Because the biology profession was unaware of the consequences. Basically, biology teachers will wash out from PhD programs who were bitter about the process and not certainly serving the students. What units should we measure on this? And the idea is, of course, likely decades, but minimally at least years in order to do this. There are also a series of interdependencies of data quality. So don't keep them isolated on a permanent basis. Here's an example with governance being related around to a CRM initiative. So if we're going to do this, we can go back to our DIMBAK. And instead of looking at data quality, which is right up there at 11 o'clock on that, we can look and see data warehousing, governance, and quality together will probably produce the kind of result that we're trying to get towards. Our process should be a forming process. We allow the form of the problem to guide the solution, gives us a means of decomposing it, provides some level of understanding where tools can be useful in this, and a strategy for giving us a programmatic solution, evaluating a solution then as to whether we succeeded with the business context. And most importantly, keeping this as a growing body of knowledge internal to the organizations so that we can continue to have people work on this. Because once again, do we think data is going to be more important in the future or less? I hope you agree with me, it'll be more. If you're not familiar with the book, The Goal, it's a wonderful book that provides inspiration and really taught me a pretty big lesson, which is to say that while a chain is no stronger than the weakest link, and we have to understand that if we have data quality that's going on at a poor level of performance, then no matter how good our products are, we're going to still have problems around that. And as I mentioned, this theory of constraints actually accrues up to something that hopefully a lot of you are familiar with as well, the standard Deming Circle Plan, Do, Check, Act. Let's see how that works in a data quality cycle. Again, let's take a step, which was the steps I just showed you before. Let's plan something, let's do something, let's study the effects of it, and then let's make another change and this cycle continues to repeat. So once again, let's see what we can do with planning. How long is it going to take? What should it cost and what sort of results should we get? Let's put it in place. Did what we wanted to do actually stick? Again, are they doing changes on the assembly line or whatever it is that we're looking for? Are we monitoring to see what's happened with that deployment? And then can we take some action in order to come back and take it to the next step? There's a little bit of confusion about this because we look at a data quality life cycle model and there, unfortunately, are two different starting points here. So the first starting point for new systems development is right there and we go all the way around that direction. Again, metadata, data creation, utility, etc., etc. Don't worry, there's whole book chapters if you want to get into this at a much more depth letter. But if we've got existing systems, we have a very different process where we start there and work our way around the cycle in that sense. So very fundamental, but very subtle changes that we need to make in order to make this work. Finally, a quick section on how we do get better around it. First of all, what is strategy? Most people think strategy is a product. That's the way that management consultants have made us think about it. It's a hundred slide PowerPoint deck or a thousand pages or whatever it is. Back to the military, though. The military thinks of strategy as more of a process because they have a saying, as soon as you get into the battle, all your plans go out the window, right? And they don't say it that way. They say it in a much stronger way, but you get the picture in terms of all that. So when we're looking at data quality strategy, one of the things, again, I want you to understand here, many of you remember Seven of Nine from Star Trek, just sort of a strange character. My version of it is one in four. All right. So if somebody comes to you and says, I'm going to invest a million dollars in a company on this, then I say to them, the only way that's going to work is if you invest significantly in the people in process support. So with a million dollar investment, should I invest a hundred thousand dollars for the people in process? Five hundred thousand dollars, a million dollars, and each know the answer actually is four million dollars. And if you've only got a million dollars in the first place, then you need to scale back your initial investment so that you make a two hundred thousand dollar investment and invest eight hundred thousand dollars in making sure people and processes are able to do what you're trying to do with respect to the tool. Finally, a piece on this about conversation topics. Again, our engineers say, I'm going to go clean some data, but what does the business want to hear? I want to decrease the number of undeliverable targeted marketing ads. I want to recognize the database, reorganize, excuse me, the database, says the engineer. And the business says, I want to increase the ability of the sales force to perform their own analyses. Yes, that's what we'd really like as they are the smart cookies. I want to develop a taxonomy, says the engineer. No, the business doesn't want to hear that. They want to hear I need a common vocabulary so that we'll have fewer miscommunications here. I want to optimize a query. He says, well, how about if I shave one second off a task that runs a billion times a day? Yeah, that gets us into some very significant savings there. Or, golly gee, I want to reverse engineer that legacy system. No, I want to understand what's good about the old system so it can be formally preserved and what was bad so we can improve it. Our data leaders still don't have this kind of ability in there, but our data leaders are focused on the ability of raising the value of our data assets as an inventory component in here. We're going to decrease the amount of rot. We're going to uncover the first version of a data strategy and we're going to monetize our organization's data. We need to do that because if we don't monetize the data, we don't have quality data. So everything the CDO, the data leader in your organization does, requires quality data. Are you going to be able to fix all of it? No, but find the stuff that's important, get working on it, and let's make some progress. There's another decision that has to be made in terms of the how. There are only two strategic initiatives, one to improve existing things or do something new. There are no other components on it in terms of how that works out. We've got quality one or quadrant one here where organizations without a formal data quality focus, we don't want that. We want y'all to have a formal focus on it. We can try to improve efficiencies and effectiveness and Walmart's always been recognized as a genius set of companies in that area, whereas Apple tends to get credit for being an innovative company. Now, whether you agree with those two or not, don't let's just take it for the purpose of argument here and say let's not try to do both of those things at the same time. Instead, start out and look at your organizational efficiencies and effectiveness, save money, and use that money then to go in and focus on your strategic initiatives. Again, my own personal total here is more than a billion and a half dollars saved in 35 years. That's a lot, but it shows that there's a lot of opportunity out there in data, and we want to hear your stories here as well. This is the whole reason for DAMA International and having chapters around this. Let me give you a very specific example of money saved. This is a Defense Logistics Agency challenge. Again, I mentioned I was just down there this morning. We were looking at millions of NSNs or stock keepers units that were in a catalog, but somehow in a previous reorganization, all that data had been moved into comment fields in there. So we needed to extract this data from the comment fields and pull it out. Well, originally they were just going to put a lot of brute force into this, but we developed a what would now be called text analytics process. What we're really doing is converting non-tabular data into tabular. There's no miracle here, but we did manage to save $5 million for the Defense Logistics Agency in this and my first person century of work. Now we tend to start out here by looking at the word munch, which is a very strange word, but it means everything that we have to do to make the data so it's capable of being moved or analyzed. And the problem is it looks too much like alchemy. So we have all these hidden data factories that are out there where work products are delivered to Department B and they get to make corrections to it because it's easier than going back and yelling at the people in Department A and Department C corrects B's work and then the customer catches C's error and these are all problems. Any weak link in the chain causes everything to slow down. More importantly, to introduce these hidden data factories and again I'll give Tom Redmond credit on this for coming up with a wonderful term. These hidden data factors are all over our organization and these single data quality issues raise multi-faceted organizational challenges. In fact, all data quality challenges are filtered through some combination of business challenge and IT system and if we don't have people looking for it in a root cause sense, we won't know that one data quality challenge at the center of this diagram is causing all of these business challenges on the outside and we need to take this root cause analysis and understand the data components that are involved and also reverse the flow on this analysis so that we have a team that's not getting better because if I have just Wally Easton out there doing things, that's not going to work. Alright, I said I was going to give you the numbers on this. This is the diminishing returns exercise that we did on this. We went four weeks into the project and ended up solving 655 percent of the project and also discovering that 12 percent of their data was completely ignorable, rubbish. No way to check on it at all, no reason to check. So our problem space had gone from 100 percent down to 32 percent of the original problems. We worked on it for another 14 weeks and eventually decided that one more four-week sprint would be worth it in order to go through here. If we got a certain type of airplane part was what the customer said, that would be really well worth it. You'll notice the numbers go up and down a little bit. We finally settled on 7.46 of the problem, not being able to be solved automatically, throwing away 22.62 percent of the data and being able to automatically handle 70 percent of the data. I don't know about you all, but this was the problem before. Two million records that we had to individually look at and now we're down to only 150 thousand. That's a good balance. Let's see how the numbers worked out in that area well. Again, if I'm looking at two million NSNs, and let's just say it takes five minutes to fix each one of them, there's my total time in minutes. How many people, excuse me, how many weeks of work do we get out of people five days a week, seven and a half hours a day, move some minutes around, come up with some numbers. I need 10 million minutes to go off and do this. That gives me my person's century. That's 92.6 centuries in there. $60,000 pay rate on them gives us a total of $5.5 million costing to do this. This is just the planning. Here's now the revised plan. I'm going to put the numbers in brown. Now I don't need two million cleaned. I only need 150,000, which means my times come significantly down, and my overall cost comes down to less than a half a million dollars in here. There's my $5 million savings right there in that very quick exercise, and of course all of you are smarter than this. You know that it doesn't take five minutes to solve a problem. If I double it to 10 minutes, I've saved $10 million. The key for this is kind of like musicians trying to get to Carnegie Hall. It works well both with practice of data quality as well as telling the stories that we need to tell about data as well. Of course the question is how do we get to Carnegie Hall, and the answer from a musician is practice, practice, but you need to have both good music and a common vocabulary in there to do this. The key to this is developing your own set of data quality components internally that will now allow your organization to do a much better job of this because instead of everybody trying to fix all the problems all the time, you can throw data quality to a group that becomes smarter as a result of doing this. The process of getting smarter means that the next quality thing comes along. You have the previous learning that's come along and you have a group that's dedicated to it so they're not thinking about other things, trying to do all this off the side of their desk. We've spent the last 50 minutes here looking at this. We approach data quality. I give you some considerations for the cloud, but again as you can see from the example, it's not the cloud. It's the cloud. It's a warehouse. It's any kind of data that you're moving around. I'm giving you a very definitive set of data quality attributes. I did mention that the actual details of that are in part of this presentation. I talked about the fact that most people don't understand structure versus practice challenges and they require different sorts of approach and that digitization requires quality data in order to be successful on this. I gave you definitions of data quality. It must be fit for purpose. I just talked about what a data quality engineering program must have in terms of building on the leverage on this, giving us the right examples, et cetera, et cetera. All of these are requiring high quality data to operate in a high performance automated session that is going very, very quickly. What do we need to get better? We need to teach some systems thinking. We need to not look at data quality in isolation, but in fact as an integrative part of what's going on. We need to develop repeatable capabilities and then don't forget everybody's going to come out there and sell you something called a methodology. I'm putting air quotes around that. It's a method. Methodology actually means a study of methods. How do we get better? We refocus around the business outcomes. We get good at munging the data. We have a group that becomes good at it and don't make it your data scientist, please, whatever you do. Look at data strategy and find out what are the things that are important, bubbling to the top that are going to make a difference in terms of the measures that people take a look at. We have to understand that data is an investment, so we need to look at it as an investment characteristic. It's going to take us years in order to clean some of these things up. We don't want it to have. Remember, it took them 50 years to change Popeye back from a superhero to just a normal person. We need to have conversations. We need to have storytelling time. We need to have leadership. We need to have a programmatic focus, but most importantly, you need to have tangible ROI. We're getting within the top of the hour here. I've just got a couple quick slides left here. I just want to make sure that you fill in while you're thinking of questions for both Mike and I. First one is pretty good career advice that I've gotten over the years and you should absolutely push to others. If they don't understand what you do, you are perceived as a cost, but if they understand what you do, you're perceived as value. So investing a significant amount of money in a data quality program will help your organization get better in the long run. We want them to get better, but too often we see so much in data with all of these raw, raw, raw stuff in there and no real meat on how to do this. So what are the winning program cards that you're going to need? Well, first one here is that the project needs to be small. Again, you can't start out by boiling the ocean. Any project, especially development projects, should not be allowed to begin until you have a data requirements verified for the entire project. Second characteristic is the project owner must have some sort of data skills. Few in IT and in business have the requisite data knowledge and experience in order to do this and that's a problem. Find them though. Every organization has somebody in there that they can find that actually does this stuff and can teach other people. The project must be agile ready because agile is the technique that requires more planning in order to do this than construction around it. Number four, teams must be highly skilled in both data probably processing technologies. Again, remember that four to one investment in technologies in order to make that work on that. Very few teams start out with that skill, but it's pretty easy. This stuff is not rocket science folks in order to get this. And finally, the organization must be emotionally mature, be able to understand what's actually happening in here. So again, the approach only works if we know where the data resides. We can communicate precisely. We're adept at technology support and we've got to get better at when data things happen, making sure that people understand that this is an organizational result and that that's going to happen. Again, here are our data quality values, representation, call this just reference material you're going to get for the piece on this and we will turn it back over to Mark and Mike and hope to see you guys at Enterprise Data World coming up here real soon. And we are at the Q&A section time, right? All right, Mark, back over to you and Mike. We do have some wonderful questions here in chat and in the Q&A section and one that's kind of a bit of a segue of what you were talking about and I've been thinking about ever since this person posted it. Why do we keep talking about managing data for quality these days? 98% of data is produced digitally through some software. Really, how do we fix the input, the development process and developers who in my history I've had a lot of scraps with that write the software that generate bad data? Good friendly scraps though. I mean, no friendly scraps. Yeah, yeah, professional disagreements. Because I'm talking to you all today from Virginia Commonwealth University, I'll put on my university hat and say we're not doing a good enough job of educating people about the cost of the lack of data quality that's out there and not telling them things. So they look at a picture like this and they say, oh my god, that's exactly what I want to do. And you've got software that'll do it. Let's do it, right? Well, as Mark, you well know, it is not that easy. Mike, you want to weigh in on this? How do we look at this process, get people to start thinking about it correctly? Yeah, I would say that oftentimes when we talk about developers and software and trying to fix that development process, I think that this is really a feature, not a bug in the development process. Oftentimes software is written explicitly to make it easy to get data into systems and move people through a process where the ancillary benefit of that is to use that data elsewhere. But we don't want to stop the process of a sale because someone didn't put in the right name or didn't give us the right phone number. And so we often tend to build processes around the back end in order to clean out and fix this without impacting that sale upfront. So while we could go to developers and drive stronger controls on the inbound side, I think oftentimes that's the goal, is to have less barriers on the data entry side. Absolutely. And I would also suggest that having a data person on your development team, just sort of the voice of data, you probably have all heard the story of Jeff Bezos as he was creating Amazon. He always pulled a chair up to the table and gave that chair a slice of pizza. And they said, well, who is at that table and why are they missing the meeting? He said, it's the customer. It's the voice of the customer. We need to make sure that that is represented throughout all of these things. Similarly, in a development environment, a development shop, we absolutely should have quality data people in there. And the data people are hopefully going to be versed in data quality. Let's add one other tease to this, Mike, and maybe you want to comment on it as well. We now consider it best practice in Dama that if somebody is purchasing some software, you should actually ask the person who's giving you the software to you to give you a logical model of the data as a piece of this. That's a good quality measure right away, because if they say, what's a data model, you know there's a problem. If they say, well, it's the data model, but it's not the right one, but it's close enough that you can get it. You're talking to somebody who knows something about data quality. And most importantly, you can evaluate whether this package that you're proposing to buy complements your existing practices or it's going to require more work in order to get it to fix. Any thoughts on that, Mike? I think you're right on. One of the largest changes we've done at Ralsei was part of our implementations is driving data structures and data models upfront before we even signed contracts with customers, because we were finding that customers would buy our software and then we would go, okay, let's talk about your data and then it would be a month to try to track down all the data models and where they are and what data residency is. We've just moved that way up front over to shorten time to value for our customers, but that's an absolute key thing. The earlier you can get that, the earlier you could understand what it's actually going to take. What are we working with? Where are our gaps? So getting that as far to the left as you can is great. Fantastic, certainly agree. All right, Mark, what's our next question? I just wanted to chime in a little bit on that some more, because it's a fascinating topic. This one time at Bandcamp, working at an organization with an enterprise system that was a little long in the tooth and there was a need to capture some data, and the developer said, well, I can just add a free format text field and we'll just train people to type in the data that we want to collect into that free format text field. And you can imagine the imaginative things that ended up going into that field. And well, sometimes it can be a relief to be able to capture things that you couldn't capture before. In some cases, it can create something that our friend John Ladly calls data debt, where you're just piling up all of this poor quality data into a system somewhere and your expectation is some poor person down the line is going to have to deal with it. And as I mentioned on the DLA example in particular, literally it happened somehow somebody in the government had paid a contractor to take data out of an IDMS database and move all that data into comments fields, which meant that our military was slowed down by people looking up data in the comment field and saying, if there's an asterisk in column seven on a Thursday, that means it's an Air Force part instead of an asterisk in column eight, which, well, right, this is not the way we want to run our military. And of course, the military didn't put up with it for a long time. I'm certain everybody has stories out there. And again, love to get your stories on this because we love to tell these stories and we want people not to make the same mistake twice. There's a wonderful question in chat. What one thing must you have in order to get data quality right without which all other efforts will be for not? Who has the primary responsibility to do or obtain that one thing? Well, what you're describing is, of course, a key piece. And I guess it's to keep performance indicators and things like that. But I think the question's probably a little bit more focused. And maybe the area that we should look at is the attributes, quality attributes that I showed at the very beginning here. So let me pop that slide up and say, you know, is this, in fact, project a project that's going to require a new architecture or new model or a new representation or a different value? And again, where is it that the problem is? And of course, I gave you the dichotomy with that the things on the left are probably the things that people have more demand to fix because we're closer to them. But the things that are more costly are the model quality and the architecture quality. Anybody remember SAP VW? And I'm sure there's a bunch of people on the call here who are laughing. It was a product that SAP put out and it was well known that it was buggy and that you could use it for certain things and it works so well for other things. Well, that's an architectural quality issue and that's going to be a big, big challenge around that. Mike, you want to jump in? Yeah, so I would go a little bit different direction. I would say the biggest cause for failures in antique quality programs that I see are not tying it to a business outcome. And you tie, you know, call this out in your presentation, Peter, but in the end, the people that are signing checks and signing contracts and allocating budget want to understand how that quality is going to impact that budget, how it's going to impact the bottom line. Oftentimes, we jump right into logical models. We jump right into data architectures and we don't really understand what's the business problem we're trying to solve. And that helps really drill down into we can get rid of half the models because they don't apply to this specific business problem. And we often skip that. Absolutely. Again, I would endorse your suggestion there a hundred percent. If you don't know what the... I mean, so again, colleagues of mine have done this kind of work and they've gone into some organizations and they've started to work at the A's, right? And they come back in three years and say, I've cleaned all the A's. And everybody goes, nothing happened in our systems. It's any better because they haven't hit anything material yet. They've hit the trivial, the rot components of it. So having those business analysis skills to be able to go in and figure out what is actually the problem. And again, Mark said at Bandcamp, right? We all have these stories and that's our code word for we can't tell you what customer it is, but if we tell you enough about it, you'll be able to figure it out on there so we don't violate any agreements. And let's learn some lessons from this if we can. I mean, again, the city of Seattle by simply taking and asking for a cable of 2.52 inches and having it get rounded to 2.5 inches causing a million dollar delay on a project, that's tangible. And that's something that everybody says, I don't want you to do that again. We've got to make sure this doesn't happen again. We've got a couple more fantastic questions. You mentioned the example of turning unstructured data into structured as part of a data quality initiative. How else are you thinking about data quality in terms of unstructured data? I'll make a little correction here just to make sure everybody understands the point that I was making here, which is actually if somebody comes to you and says, I can turn unstructured data into structured data, I would also hand the glass of water and say, please turn it into wine because we're in clearly the wrong business. The actual thing that I said was turning non-tabular data into tabular data. And that's a little bit more of a trickier proposition. So if something was truly unstructured, we really couldn't structure it. That's the definition of it in order to do that. But yes, we've got a very nice unstructured data problem or a non-tabular data problem when all of this data for our supporting operation, this will be real clear about the mission of the Defense Logistics Agency, get beans and bullets to the people who are paid to break things and kill people. And that's their mission, that's their job, and we want them to do it as effectively as they possibly can. But if somebody comes to you and says, I can turn non-tabular data into tabular data, that's something that they might be able to do. But they tell you, I can turn unstructured data into structured data. They're pulling your leg. It's either not structured data or they are adding more structure to some data that previously has less structure than perhaps you would desire it to be. For example, most people are not aware that there is an actual structure to a Word document. They tend to think that they really unstructured data. Well, it turns out by moving to an XML-based format many years ago and standardizing on it, most all Word processing documents now have the ability to have some structure in there. The structure might be document start and document end, which isn't terribly useful, but most people use more structure of the various packages in that. So you can then chop the package into sections and each section can go into subsections and all the rest of that. And the process of trying to understand and better utilize our less than perfect tabular data, our non-tabular data, is something that we can approach. And I have a couple of papers that I've put out there on the website. Mark, remember I'll try to put one on the end of this and you can add it to the PDF that you send out for the folks in order to do this. But the key is to really be realistic about it. If I've got a Word document that's all digital, it has a structure. You may not have a good structure, but it has a structure. If I understand what that structure is, I can then decide whether to use it. But would it be valuable to the organization to take 100,000 Word documents and add additional structure to them for the purpose of getting $5? Probably not, right? So again, we want to use the right measures in here. And if somebody says, if you were able to capture structure on this and had some information in here that might help us do better sales or whatever other things that we're trying to do, then we can get some things going and then we can now start to move to that. Mike, what is your structured and unstructured argument that's been with customers? Yes, I actually have a great story of a customer I recently worked with. For those of you in the financial services or insurance space, especially in insurance, long-standingly has interaction with customers been via policies. So you may have an underwriter that has a brokerage. The broker knows the customer, but really the underwriter just knows the policy. Well, that is rapidly changing. And so these underwriters and managers of policies want to start viewing people as customers, but they have just a big document. That's their policy document. And so they want to better serve you as you call into their customer support line. We have done some work with them to use natural language processing in order to strip out sections and provide some structure to already structured policies, knowing that names tend to be in certain places and beneficiaries tend to be in certain places in their policies and use that data to reduce the time it takes for them to move in from their IVR automated call center to the right department by half. And so they're getting savings really on their IVR customer service center, but what they're really doing is taking relatively unstructured data inside of policy documents and making customers out of it. Excellent. Excellent. And also you can take the work that you've done there and start to develop controlled vocabularies for use within the organizations as well. I have another slide and send the DIMBOK, but it just says, for this insurance company, these are the four definitions of these four major concepts that we're going to use. Again, another little piece of wall arch that goes up where people, okay, when I say customer, when I say these are the things that I mean and this means the same thing to everybody in the organization. So you can see from Mike's example, it's not just a matter of going in and fixing a field, it's a matter of understanding what value that document or that sort of data has to the organization and then being able to leverage it as well as you can. Mark, you got a bandcamp story for us on that? I don't this time, but I do have a wonderful question from Chat. Just because this always sparks the discussion in my head and makes me think about architecture a little bit and it always sounds like an innocent question, but it always leads to discussion. So should data be cleaned first before it gets integrated into a CRM? Well, I would say yes, Mike. I would say yes if you can. There you go. It gets into the process of whether it's a bug or a feature to have bad quality data on the inbound. I know for the sales teams that I work with, if you slow them down on data entry or cleaning data as they're getting stuff into the CRM in order to make a sale, you're going to be in for a bad time. And so we tend to do things on the back end to help clean stuff up, but it really depends on your organization. And it's key to find out what the business purpose is, something that Mike referenced earlier on. I'll tell the story of my cousin, Fred, who has retired now and he's out at Smith Mountain Lake enjoying his retirement. He might be listening, but he worked for Kraft Foods for many, many years and he would work four days a week and he was one of their most successful sales associates that were involved in this. But every Friday, he would take the entire day off because he would take his own system for managing sales and convert it into whatever the CRM that Kraft was using that particular year and it took him an entire day and he would complain to me about it over and over. If I don't fill this out, they'll think I'm not working. I'm working, I'm selling, but I'm not filling out the right data that they think they need because they don't understand the data needs of their sales people. And again, I think that you sort of hit that nail on the head. If you can show that your sales people are going to be better prepared to meet the customer needs in there, then that's an improvement in data quality and that doesn't necessarily have to do with the numbers being wrong or the numbers being in the wrong place. It's the question of what sort of organizational support are we going to produce for them? So quality goes all the way into design thinking if you get that flavor of things in there. Absolutely, really, really good example there. And the reason why I've had some interesting conversations about this, and I agree with both of you that yes, you generally should clean before you go into CRM. My time in higher ed has seen CRM implementations used for recruiting and admissions and prospecting students. And so usually it's the first point of entry for our data and the true enterprise system is further down the line whether that's something like a PeopleSoft or Ellucian or what have you, whatever organizations are using Workday even. So it becomes an interesting question as to is that your first point of ingestion? Maybe that's the first point that you have an opportunity to clean data. But yeah, I largely agree with you that the vast majority of the time it's where the action has already happened. And I gave a glib answer by saying yes, but Mike absolutely corrected me by saying yes if possible because it is absolutely the case that we have seen in many instances where it's simply not feasible to do this. But if you could get involved in the planning process earlier, you can say, look, I know you want to go live on January 1st, but if we can delay it to February 1st, we can spend the month of January trying to improve the quality of the data that's in our new CRM before we turn it on. And everybody looks at it and says, oh Salesforce sucks. And we know that Salesforce does not suck. It's the data that's in Salesforce sucks, but most customers can't parse that difference. I keep thinking Salesforce ought to be the sponsor of these things Mark in order to do this because they can sell a whole lot better stuff. But they deal with government contracts. They don't have to worry about quality data. A little dig at the process there. I mean, to that end also, I think that it's a good opportunity to look at what data are you trying to collect. I can't count the number of times that I've seen a customer come to me and say, we have really bad data. It's horrible, horrible data. And I said, well, what are you using this for? Like in our CRM, do sales teams use this? Do marketing teams use this? And we find out that nobody's using it. They're measuring data and saying it's really bad and horrible, but no one even wants it. And so let's stop trying to add to this, right? And let's try to fix data that no one cares about. Absolutely. Kind of non-value added work there. Well, I think we have time for just one more question. So we'll get into this one here. How can we prioritize data quality issues within an organization when there tends to be multiple different differing opinions amongst relative stakeholders? Are there some best practices or particular questions we need to be asking? Have you ever implemented a ranking scoring matrix or something to determine which data quality issues should be focused on first? Absolutely. And there's a lot of different approaches, but again, I like the Plan-Do-Check Act as the most basic one. So you can go study all sorts of PhD dissertations on the right quarter of these various pieces in there. What it really comes down to, and again, Mike, I'm pretty sure you'll agree with me on this, you got to find the person who's in the know about how this data is actually used and then find out from them what are the things that they look at. Most models are over-complicated. Most data quality initiatives are too broadly focused to be, too broadly focused, but more broadly focused than they need to be in order to actually produce the numbers. On the other hand, you don't want to turn around and look at this and say, okay, well, all we need to do is get different measurements around it or get a more granular levels of measurements. There's a lot of simple solutions that people go to very easily that solve correctly the wrong problem. And that's something we want to avoid doing as well as we can. Again, Mike, Mark, any other questions or comments on there? And what have you guys seen in terms of ranking? I mean, I agree with everything you said and a very kind of a little bit of a good answer, but what saves you the most money or reduces the most risk for your organization? Inevitably, when you can go to your CDO or you can go to your CMO who's funding a program and say, I can increase your sales transitions by 10% or go to your logistic system and say, I can save you 10% on shipping costs, that thing gets prioritized, right? You can tie dollars to it, it gets funded. If you can't do that, it doesn't get funded. That's the reality of the world. Let's be real clear. Exactly. Sorry, Mark, go ahead. I totally agree with what Mike said. And this is something that I've done for like 20 years now. And it's always that connection to the business is the most important thing. So you can't just say, I'm going to fix data quality to fix data quality for data quality's sake. If you're fixing data that nobody cares about, nobody's going to care about you after a while. If you're connected to something that the business is doing and actually connected to a strategic goal that's in somebody's strategic plan or is an important part of how the business is running and making money, then not only are you going to get funding to solve that problem, but you're going to have stewards out in the wild that are going to go out and fix data quality problems. You're going to have that momentum and action and that's what's going to get your program to work. In my opinion. Sorry, you got me on my soapbox. Absolutely. And again, we love it. And this is why we like our community so much because you guys come up with all sorts of great ideas around it. All right. So that's all we have time for today. Thank you so much, Peter and Mike, for this great presentation and Q&A. Just to remind everyone, we will be posting the recorded webinar and slides to dataversity.net within two business days. And we will send out a follow-up email to let you know the links and other requested information. Thank you again for attending today's webinar. And I hope everyone has a wonderful day. Mike, thanks so much. Mark, thanks for hosting us. And thank you all for tuning in. We really appreciate it. Yeah, thanks, everybody.