 My name is Victoria Soslasky and I'm the Webinar Production Assistant for DataVersity. We would like to thank you for joining today's DataVersity Webinar, Data Quality Engineering. TSA's April edition is a monthly series called DataEd Online with Dr. Peter Akin brought to you in partnership with DataBluePrint. Now we'll move the floor to Megan Jacobs, the Webinar Organizer from DataBluePrint to introduce our speaker in today's webinar. Megan? Thanks, Victoria. Hello everyone and welcome. My name is Megan Jacobs and I'm the Webinar Coordinator here at DataBluePrint. We are pleased that you have the time to join us for today's Webinar on Data Quality Engineering. There's always a big thank you goes out to DataVersity for hosting us. We will get started in just a few moments to let you know about some housekeeping items and introduce your presenter. We have a one-hour presentation followed by 30 minutes of Q&A. We're going to answer as many questions as time allows, but feel free to submit questions as they come up throughout the session. For the top two most commonly asked questions, yes, you will receive an email with the links to download today's materials and any other information you request during the session within the next two business days. You can find us on Twitter, Facebook, and LinkedIn. We use the hashtag DataEd on Twitter, so if you're logged on, feel free to use it in your tweets and submit your questions and comments that way. I will keep you on the Twitter feed, and we'll include answers to those questions in our pro-session email. Now, let me introduce you to our presenter. Brayken is an internationally recognized data management thought leader. Many of you already know him or have seen him at conferences worldwide. He's more than 30 years of experience and has received many awards for his outstanding contributions to the profession. He's also the founding director of Day of Blue Print. He has written dozens of articles and eight books. The most recent is Monetizing Data Management. Peter has experienced with more than 500 data management practices and 20 entries and consistently named as a top data management expert. Some of the most important and largest organizations in the world have sought out his and Day of Blue Print's expertise. Peter has set multi-versions with groups that hold diverse as the U.S. Department of Defense, Deutsche Bank, Nokia, Wells Fargo, the Commonwealth of Virginia and Walmart. He often appears at conferences and is constantly traveling. Peter, what are you up to this week? I am the guest of the Federal Communications Commission. When we're talking about dirty data, we don't mean really dirty stuff. The key is we're just in town for a little bit of work. I'm sitting in a beautiful office at the FCC headquarters looking out over the Potomac, watching the airplane take off from the Washington airport on this. It's kind of fun because everybody does end up with some interesting data quality challenges. The FCC is no different. They also have data quality challenges, as everybody does. What we're going to talk about today is we'll start out with a data management overview, and I'm going to give you a definition of data quality engineering with a very specific example so that you understand contextually what we're talking about here. Then we're going to go through a data quality engineering cycle, and we'll talk about some computations that occur with that. The main one is, of course, data quality problems manifest themselves in a myriad of ways. Because they manifest themselves in a bunch of different ways, it requires a bunch of different methods to get them. For them, we'll dive into the clauses and dimensions of data quality because there is a cause and a sex relationship that we need to be able to understand and know which group of techniques to apply as we're looking at this. We'll then talk to quality and the data lifecycle and talk about how our thinking on that has evolved over time and finish up at the top of the hour with some sessions of data quality engineering tools and things. And then, of course, the part I look forward to is your questions and answers because that's sort of how we take the pulse of what's going on out there in the world is based on what types of questions you guys bring to these things. Whenever you do it, we enhance the presentation. It has a way of thinking. Thank you for giving us feedback and telling us the things that you're interested in. So let's move right in. First of all, I always start with that because most people don't get this particular picture when they're talking about data. Now, remember, data management is the process of grabbing data at the source and getting it all the way to the destination as quick as effectively as possible. This rather complicated diagram, which we tend not to share with the managers, is what it means to do data management right. There's a lot more detail that goes on around this and I'm not going to simplify it for all of us on the call here. Four to five areas. The first one is making sure that we're all literally singing off the same piece of paper. Now, the idea here is, of course, that you have lots of good data management initiatives going on in your respective organizations. And these people are very good at what they do and they try really, really hard to do it well. However, one person can accomplish some things. Everybody pulling together in the organization and focusing on a very specific goal can accomplish much, much more. So it's getting everybody on that same sheet of paper and trying to attack the same goals in whatever budget cycle that we're working on. The second area of data management then is to look at organizational data integration. And the idea here is, of course, that we're sharing data across boundaries, whether it's from one program, as in a software program to another program, whether we're moving data from transactional systems to a warehouse, whether we're sharing data between different departments of the organization to another organization, or whether we are sharing data between our organization and our business partners. We're sharing data across boundaries. If that process is happening unmanaged, there's no way it can be optimized. And if it can't be optimized, it means you are losing some resources. Our research shows that you're spending 20% to 40% of your IT budget working in these two areas on the top. This doesn't mean you're going to invest in organizations, although, again, if you look at it from the positive angle, if you save some money by doing this, you will actually end up with your organization saying thank you. The question of who to say thank you to is an interesting one as well. And what we typically think of in today's environment are the data stewards. They are the people that are assigned responsibilities for some portion to the data. Until it says in most data problems, people think of data quality as being carried by these problems. This is not really true in reality. And if we don't reward people for fixing things, then they will not fix things. So what we have in this situation is the minute it says, Peter, your next raise, your next promotion, your next whatever is going to depend on you measurely increasing the quality of this portion of the data. This never happens. So these data stewardship functions, while they are relatively new, are now becoming more commonplace. Or in more organizations now that they are not. The fourth area is data engineering, which is the process of taking and understanding how to build systems. In the past, this has always meant building relational database management systems. In today's environment, we have a lot of other things that we can bring into play, which are different data delivery systems, all kinds of other alternative portals, XML, different things to work in that. And finally, data systems just like our automobile and transportation systems need constant maintenance. If we don't maintain them, there is no ability to continue to deliver. They become out of tune, they become loaded, they become efficient. Again, this is what we mean by integrating these five data management practice areas. So that's an important concept as well, which is that most people think the data management is what goes in this green triangle here that I have on the screen. Triangle represents really in our mind the tip of the iceberg. We use the silver bull technologies, things like cloud and master data management being about whatever it is that we're talking about. Fortunately, though, these things are really just the tip of the iceberg, or we also talk about them in terms of Maslow's hierarchy of needs. If your food, clothing, and shelter needs are unmet, it is very unlikely that you will turn around and self-actualize at night. And data management is exactly the same way. Without the basic data management practices that I've taken off that previous chart and put on the bottom here, really the bottom of the iceberg, the part that most people don't appreciate, there's absolutely no way that you can succeed at the top within the privacy that most organizations have. The question from a lot of organizations after we explain this to them is, why don't we do that on-part stuff? Can I just do the stuff in the green triangle? And the answer is absolutely yes. You can do it, but it will take longer, cost more, delay less, and present greater risk to the organization that if instead you learn to all walk and run your way to the top of that process. Some of you are also familiar with the DEMA body knowledge, the DM as we call it. It's like it's analogous to the Timbock and it has in fact become the de facto standard for talking about this. And I have to tell you a funny little story on this because it's noted in my diagram here. The data quality piece is actually the last chapter in our Timbock. And if there's one thing that we shouldn't have done was put data quality as the last chapter. You can see it right up there at 11 o'clock on your screen there talking about data quality management. And frankly right now this data is the Timbock to the revision that we're working on. So that chapter is the last chapter and we're going to fix that because we simply can't let the data quality piece be last. It's just one of those ironic things that's hilarious for us to put together. There are about 2,000 people worldwide who have passed our CDMP exam. It doesn't compare to the past. Tens of thousands that have gone through and passed their PT certification, Project Management Professor certification. But we are starting to see places like the Federal Government in Washington D.C. here where they are asking for people who are qualified, who are certified to be able to work in this area. So this is our attempt to start working on this professionalism perspective. Now, the Timbock, the data management body of knowledge, the chapter on data quality engineering in fact starts out just exactly like this, which you're looking at as an input-output diagram. There's the inputs, you can see them on the left-hand side, the activities that are in the teal box right there, the outputs, the deliverables, and of course we have goals and participants on the top and the bottom. And this is basically comprising the subject material that we're going to talk about. And also I've superimposed now those same data management practice areas in there. So hopefully that gives you an idea of the connection all the way around on these things. Let's move on a little bit now and get to a definition. First of all, we always ask the question, if we're going to talk about data quality, what do we mean by data just for starters? And the answer that I give to most people is the number 42, in which case people usually look at me like a dog looks at you and you try to explain something to the dog in English. But 42 for those of you that don't remember is the meaning of the universe and everything. That's from Peter's Guide to the Galaxy. It's a really fun little book. Some of you, undoubtedly, on this call are traveling with good memory of a really good read. The rest of you are thinking, Peter's crazy. Well, Peter's crazy, but what he has done is give a little bit of meaning to the number 42, which, before you saw this slide, was just a fact. Now you know, at least in Peter's mind, and those of us that have read the Peter's Guide to the Galaxy, we understand that 42 is the meaning of life, the universe, and everything. And that is what we mean by data. The fact is kind of useless. It's also my age 30 years ago, not that that would have anything to do with anything you're doing from a business perspective, but it's the same fact with a different meaning. The facts and meanings, in part of them with requests, we get what we call information. And this is a big mistake in IT because we think that that first understanding, the spiritual understanding that we get from most of our requirements exercises actually represents what it is that we're doing with the data in the business. Turns out it's actually a lot more complicated than that. We have to go third layer, which is intelligent use or strategic use of the data. Once we've gone through that iteration, we now understand that it's not the link we need to use it for. It's how we observe the businesses, the organizations using this particular data. Now, you'll also notice here that there is a structure here. It is impossible to do intelligent use of data without first understanding how data is structured, which is a bit about getting facts and meanings together into some information, and how that information is matched internally in your organizations to be able to do this. Again, hopefully this chart will help you. I've had a lot of people that were quiet. I did not create, if you look at the bottom right-hand side, a fellow named Dan Appleton, who I worked with in the 1980s, did a great job of explaining this, and everybody loves the particular chart. So now you know what data is. Let's talk about what data quality is. Simple definition that we use. Most everybody I know, works in the space as data is fit for use. Now, for use means that it's okay to use when you need to use it. Again, we're a lifecycle. It's where the data is captured until the data is used. Everything in between, we call data management. I'm going to tell you a little story about spinach here that I find absolutely hilarious. You know, there are people for years and years that said each are spinach, because spinach is high in certain vitamins. You know what? It's actually a data quality error. It was made here in Washington by a group of federal agencies who made an error. They made an error when they were calculating, and they thought that spinach was 10 times more potent in its vitamin mineral content than other vegetables. It turned out it's not. So I'll talk about spinach for years and years and years. It is absolutely based on data quality. They corrected that, but of course they made that out there already. Data quality is also appropriate to talk about information quality, and really the two are synonymous. I would prefer people choose one or the other, but not try to split hairs and find them out. One of their messages about the diagram from the previous page is that if you're trying to manage them separately, you create more work for you than if you try to manage them as an integrated whole. Now, we know what data quality is. Let's talk about data quality management, which is the activities that apply between the time we collect data and the time we deliver data that allow us to improve, assess, measure, and ensure data quality. It also means we have some roles and responsibilities that go into it as well. It is absolutely critical in this context to make sure that you have support for chain management, and also understand that it is not something you do once. You do different levels of it, but data quality management is a continuous process. Let's move on to the term that I really like, which is where most of these standing stops, and that is that data quality must be engineered. It has a discipline that applies to it just the same way as the building I'm sitting in, because overlooking the Potomac also has some engineering components that go with it. If it was engineered, this building would slide into the Potomac River. Again, we have definitions of engineering, but these concepts of engineering are not really understood either within IT or business, and that means our job as data quality engineers is a little bit harder. We have to do a little bit of education around this. I'll give you a very specific example on this. This is one that did across the river here for the Defense Logistics Agency, and a challenge that they had were that there were millions of ESNs, whereas non-government folks know them as SAUs, which are key units that were maintained in a catalog. They're using this system to a big ERP, and they needed to get the data out of this catalog, but they couldn't just export it because the data was stored. Millions of NSNs were stored in clear text, content fields, blobs, if you will, that were out there. So they made a decision that they should manually approach this and have individuals sit down and go through each of the 2 million NSNs and SKUs to pull this information out. Now, that would have extracted the information, how it wouldn't help them with the structuring problem. The decision I want you to understand is, in this case, what we would now call text analytics. Many people are very fond of that particular term, and it's a very good term. We didn't call it that. And that is what we were really doing was converting this non-tabular data stored in the content fields into tabular data that we could put back into the database and find it. And I was really proud of this, not because it was the government $5 million, it's a taxpayer, I'm, of course, all about doing the government money, but it was the first time, any time I had done this work, where we saved people a person's century of work. Now, I'm going to show you this chart here, which is a little complicated, it's got a bunch of numbers on it. But I'm going to show you, I'm also figured out how to identify the point of diminishing returns, where the problem seems to be an engineering problem and could be turned back over to manual intervention. The numbers are pretty easy. We have the left-hand column constant. It was two of our data engineers' time, working 20 hours a week on the problem. So it was a sixth cost. And the first week, we were pretty bad. We didn't match anything. Of course, this means we have to move into the expectations game because we don't have anybody to think we're going to solve the problem with the absolute first week we approach it. On the other hand, by week four, we had identified 55% of the data needed to be extracted from this and been able to demonstrate that we could extract it. So after four weeks, we had solved half the problem, a little more than half of the problem. In addition to that, we'd also determined that a set of data in there was ignorable. And you can see that represents 12% of the problem. And we were then tracking our unmatched what's left over. And you could vary it when it went up and down, which meant we were still fine tuning our algorithms for data quality. Now, the question was, we had week four solved half the problem. When do we get to the point of diminishing returns? When can we identify specifically that we were getting less of it than we were going to put into it? And that's the point at which we wanted to stop. It turned out for us that was week 18. You could see that by week 18, we had actually gotten the unmatched problem down to 7.46% and a half percent of the data was simply unrecognizable by our text-playing algorithms. Our ignorable items, however, was very flat. We had gotten them to 22.6%. So we could ignore one-fifth of the problem right away. And you could see our proprietary improvement algorithm got to 70% of the problem. So if I can straight 22% of the data and a 70% that I can solve, my original problem here is much smaller than it was when it started out. So in the lifetime of this project, we took a measure of five minutes on. It was just 10 minutes to clean. You can see we were going through all of the calculations here. But the important one is the one that's on the screen. The total person years were 93.6%. We didn't multiply that by a general salary. We came up with $5.5 million. And one of the things to do with these is to present this to management as a plan. And they say, oh, great, that's a wonderful idea. But what's the most important number on this chart? It turns out the answer is five minutes. I don't have any data quality engineers that can fix these problems in five minutes apiece. So if I double this from five minutes to 10 minutes, I have two person centuries and $10 million. If I triple it to 16 minutes, again, three person centuries, $15 million here. Now, I told you a very simple story, but I want to be careful and not leave you with misconceptions about this. This is a wonderful source of one of the typical things that you see in the professional trade magazines where it says, hey, you can fix the problem. Data quality is an IT problem. The problem is with the sources of the entry. A data warehouse will give you a single version of the truth. The new system will give you a single version of the truth. Standardization will eliminate these problems. All of these things are true, but they are not going to get to the root of the problem. And the reason for that is because data quality manifests itself in a number of different ways. You've all probably heard the powerful story about the blind people in the elephant. Somebody approaches the side of the elephant and says an elephant is a wall. Somebody else picks up the disc and says, no, it's a spear. Different people from different backgrounds correctly understand data quality in very, very different ways. This leads to confusion, disputes, and there are perspectives on it which means we can't achieve these challenges holistically within the organization. Remember, data is our sole non-deplatable, non-degrading, durable strategic asset, and it needs to be measured and managed as such as well. The solution is, of course, data quality engineering that can help to get a more complete picture on this and help to ground re-communications on this. The focus here is on allowing the form of the problem to guide the form of the solution to provide some means of decomposing the problem. It features a variety of tools for simplifying understanding of the system. It offers a set of strategies for evolving the design solution, provides criteria for evaluating the work. I won't read these, you can get the picture. But we're going to ask you guys a question. We do have polling questions in here and we want to know because we track these things year over year. Does your organization plan to address or is already addressing this particular solution? We'd like you to take just 30 seconds here and give us your feedback on this so we can help to orient our future programs around this. So if you take a minute and just sort of say, hey, did we get it this year? Did we get it last year? What do we do it next year? What do we hope to next year? We'd love to get this information out. Victoria looks like I can vote on this one. Maybe I'll do that. Probably not a good idea on this. I'll leave the question up here. My phone's ready. I'd play the jeopardy theme music, but you guys know what that sounds like. So I won't do this well. We will pop the results up here, of course, as soon as we finish. Can't see the timer on this. We're going to be at 30 seconds here. All right. It looks like we have all the results in. Let me get the results. Pop up your floor. Give me one second. Okay. The hard one on the scenes here. He's counting the votes manually, right, Megan? Oh, yeah. Yep. There we go. There we go. Okay. So again, we did last year 15% of you. We are this year 33%. We will next year 4%, where we hope to next year 12%. 36% of you are keeping your heads in the sand because you don't want to get in trouble with your boss, so we certainly appreciate that. Again, thank you for the input there on that. That's very interesting. Actually, it seems a little down from last year, so hopefully we won't see that. There's another example of data quality. This is an absolutely infamous example. It made the rounds in Japan actually worldwide in the financial community. I think it was at this company called Mizzou Securities had a trade that they wanted to execute, and the trade was to sell one share for 600,000 yen. Yen, of course, come in many dollars, about 3,000 English pounds in this particular example. However, the trader had a bad day and sold 600,000 shares of the stock in JCOM for one yen. Four years, $350 million loss, and the in-house system did not limit-checking the data quality problem number one. The Tokyo Stock Exchange did not have limit-checking. Now limit-checking would have prevented anybody from trying to sell a stock for as little as one yen, which you can tell from the math would have been a fraction of any on this. However, what they didn't also have was the ability to cancel the orders. So they tried to cancel the order. There was simply no way to cancel the order. They were not able to do so, and the trade firm, in this case, was in the security, absolutely ended up with the whole thing, eating it. Now, back to the trade magazines where people don't really get this stuff. So you see these articles again. Well, here's four ways to make your data sparkle, right? Prioritize the task. Great. Involve the data owners. Absolutely. Keep future data clean. Easier said than done. And align your staff with the business units. All of which are important, but let's look at it in a little bit more detailed environment here. The data quality engineering cycle follows the Deming cycle of plan, do, check, act, or plan, do, study, act, depending on which way you like to do it. The idea is, of course, we're going to, first of all, identify the issues, define the business requirements for quality data, identify key data quality dimensions, and define the business rules that are crucial in there. Again, the idea is you don't just really, really jump into these things that you plan and assessment of that current state so that we understand what we're attempting to do and why in very precise terms because these data quality challenges can get very much out of hand very, very quickly. And when they do get out of hand, you end up spending a lot more money on something than you intended to, where you really just needed to fix something very quick. This helps us scope the problem, determine costs and impacts. I think this is a good way of evaluating the alternatives that come around us. As we deploy or plan this thing, we think of a way and use a series of data quality tools that we'll talk about at the end of the year to inspect and monitor what's happening there. But we can do just the data. We also have to be boldest about it and find out what are the processes that are causing the data to, in fact, become incorrect on their way in. If we don't understand those, there's not a chance. We will ever do anything other than continue to bail water out of a canoe full of leaky water. I'll do the canoe analogy in two more slides on this. Again, most people think of data as a big lake that's out there that they're using. Again, we need to put in place things that allow us to keep track of the data quality so that we understand, as it's moving through our system, we can try double-check it. We've been very successful in our processes of helping organizations maintain better quality data, which helps provide them better fuel for their economic engines to go into. And once we've identified what these problems are, we have to act, not just on the data quality portion, but with the cross-re-engineering pieces as well, as any management education and other types of issues that come into play. Now, these concepts, as I mentioned before, and I want to give full credit here to our good friend, Tom Redman, who's the person as far as we know that came up with this analogy of lake. Data is, in fact, a lake in many organizations, and while there are maybe water problems in the lake, in many cases, it's more effective to go upstream and find the source of the pollution than it is to continue to correct them at the lake. The other thing is, all of data does not have to be perfect. We talk about Pareto analysis in many cases, which is the 80-20 rule, and check out that 80% of the data in most organizations is raw. It is redundant. It is obsolete, and it is trivial. And if those are the cases, why would we spend any time managing it at all? This allows us to bring scientific, economic, social, as well as practical knowledge. And even when we do it, we still see organizations that understand the various risks. This is a document from a project that we were put on to work on it, and notice here they identified the quality of the conversion data as being very, very high. So people are starting to get the jet there, starting to understand this as they go into it. Let's move on a little bit to causes and dimensions here. The real key for this is that there are two distinct activities that support this. First, practices depend both on practice-oriented activities, where you are working with the functional areas of the business to capture and manipulate the data. But another source of hidden problems, and if I hadn't already used the iceberg analogy, I would use it here as well, there are structure-oriented activities. If you do not understand that you are missing an enormous part of what goes on in data quality, quality data can only come as a result of both of these. Let's take a look really quickly at practice-oriented activities here, focusing on the quality of the data values. They're looking at things like edit masking, room checking of the input data, CRC checking, make sure you have a complete data that has come across to you on this. Now, again, let me just dive here for an example. We have some organizations where they have input data quality, and instead of having a well-engineered input system, what they have instead is a free-form text box. One of our favorite things that we observe in many organizations is that they'll put a new system, and then they'll say, okay, now we're going to open up to the users, and ask the users to update their own data. That's probably the fastest way of getting to good quality data. Doesn't it seem reasonable? It absolutely does, but measurement after measurement after measurement, yes, I'm found in the table here, shows us that when you open up and tell people to update their own data, the data is of lower quality. It's generally not a very good idea to do this. It allows imprecise, incorrect data to be viewed in this. I found an example last night that I was just fascinated with. There's somebody about Medicare Medicaid data. I said, my colleague, Tchaikovsky, before class, he's a very, very interesting person in the health world, in the healthcare world. And then he told me a story that in Richmond, Virginia, there are three women with the name Mary Washington who were born on the exact same day. Now, most of the time when people look at this, they say, well, I can't possibly be three people in the city that have the same birthday, but statistically, it bears out in fact quite well that a metropolitan area of about two million people would have three women with the same name who were born on the exact same day. I'm talking to each other now because, of course, Medicare and Medicaid keep calling them, hey, you're diabetes. They go, no, no, that's the other Mary Washington who is trying to do this. Well, again, police oriented activities would certainly help us. One of the things we would clearly not want to do under those circumstances is set it up so that Mary Washington had the idea of their birthday and their name. You know, that sounds silly, but the country of Brazil actually did that with presidential elections a decade ago, which meant that if you started, if you had a set of identical twins, they were going to have a problem, unless their names were very significantly different on this. Let's move on, talk about structure oriented activities. Again, half of its practice, half of its structure. The structure talks to us about the quality of the models that we're putting data in, the data structures that we're putting it in, the quality of the data architecture that supports this particular organization. And you cannot be complete in your data quality engineering activities unless you incorporate both of these even here. This is your problem. Symptoms that you'll see occasionally are the data is in the system, but we can't access it. We provide the correct data value in response to the long query, or what it does not provide, because it's simply inaccessible. This is a challenge, and this gets to development methods as well. Developers tend to focus within the boundaries of the system instead of across organizational boundaries. That's a very natural thing for them to do, and it encourages additional data silos that occur. This will affect the model qualities and the architecture qualities, and I'm going to say one step further on this as well that we do a very poor job in general. I'm speaking for the vast majority of the academic programs out there, and I'm a university professor, so I get to critique ourselves as well. We do a terrible job in most colleges and universities about talking about enterprise-wide data management. And understand the implications. Instead, what we do is we teach your future IT people how to build databases. So they do it when they get in the real world. They build new databases. Believe me, if there's a skill on the planet that we do not need more of, it's how to build databases. If instead we teach students how to integrate existing databases, then they would come to you all as the business folks and say, hey, where can I go get some data to reuse in these particular pieces? And so I'll go a step further. I'd say that you shouldn't spend a dime on any sort of software development at all if you cannot answer the following question. What items are you going to reuse from our existing inventory? And therefore, because we can't find them in our existing inventory, we need to create it. That's a very simple question. And if you can't answer that question, you do not understand your data requirements and you should not spend a dime on software. I've had many, many people come to me after giving this talk and say, thank you. I'm just going to stop IT projects that were clearly headed off in the wrong direction by simply pointing out to them that they didn't fully understand the data requirements and that it would probably have to spend a couple of extra weeks understanding the data requirements before we dive in. Imagine if you roll the scrum, the fund that's going on in the scrum, the way people are working with Agile, and Agile, it works really well. And then he says, oh, wait, we don't have the data. And you have to go find the data architect. And then you have forms to fill out because the governance process is involved. Imagine if I had been done up front, the speed up your development efforts is to take data out of them. Do not let them start because they know exactly up front how much they're going to do and how long it's going to take in order to do these things. Meet that horse to death. You now see the data quality is a combination of four data quality dimensions. The quality of the data value, the quality of the representation, which mean they are practice-related and causes us to be quality. So, we also need then to look at the quality of the data models and the quality of the data architecture. These are structure-related. And there's an analog in the real world. When you are building a house, the builder comes by the house before you put the walls up, before you put them on, and they do what is called a foundation inspection. If a foundation doesn't pass inspection, it means you should not spend any more money putting up walls roof because the foundation is incapable of supporting the load that you're going to put on them. Similarly, we need a foundational inspection for these systems because we don't understand if the final piece is not sufficient. We will build and spend a lot of money on these things, but not in fact deliver the value that the business is looking for. We give an example of this and not to pick on trust. They are our bank. They do our payroll every week. I'm very pleased with SunTrust Banking, but I have to pull a little air that occurred a couple of years ago in here. One of our engineers got this from SunTrust that gave him a gift card and he opened it up and brought it around to me knowing how much fun I would have with it. Because the bank didn't know that they had made an error, we got on the phone with one of the bank's customer service representatives and said, tell us where we can spend this gift card when we go to movies and we buy food with it, and finally at the end of the conversation, the customer service representative said, did we really send you a gift card for zero dollars? And we said, by God, that's exactly what happened here. You sent them a card for zero dollars. We were hoping it was an overrun and there was actually a billion dollars on the card, but we didn't know it. Of course, schools alone could not have prevented this error, although certainly some checking just back at the Mizzouho example was made sense as well. And some people can lose confidence in the bank's ability to manage the customer funds. We did not, with some trust, we were very pleased with them as a bank. It was a collaboration. I'm absolutely certain on that. We've talked about these four dimensions of data quality, and again decided them up into the practice-oriented versus the structure-oriented qualities here. Again, the value is the quality of the data that is stored and maintained in the system. The data representation, however, is different. It's how it is shown. So in the previous example, with that zero-dollar SunTrust card, I was trying to get them to say, oh, yeah, we really should have a hundred dollars or a thousand dollars or a million dollars out there on the card. It doesn't show up very well. So, you know, go ahead and send the case, again, in this particular instance. And then, of course, the model quality, the architecture quality, we need to know how people understand these things. And again, if we only give them one course in the area, it's very unlikely that they will, in fact, be able to be quite expert at it. So we're reliant on organizations such as DAMA. You, in fact, raise the quality of the profession around here and to make people understand what it is in these areas. If we're going to get into effective data quality engineering, again, knowing that we have to understand all four of these dimensions, most organizations tend to focus in on what they're doing close to the user. You can see the data representation quality because it shows up on your smart tablets, your phones, whatever it is that you're looking at, your web browsers. That's what they see. So that tends to be where most people respond to these data quality problems. We can also say that they are related to data value problems. That's how it's maintained in the system. But again, the model quality and the architecture quality have already talked about in terms of how they use it. Now, this is the same chart, but I've expanded it now to show, in this case, the actual attributes. The attributes of data quality with respect to representation are listed there. Completely correct. As time goes on, I'm not going to read them for you. You can read them yourself. But here, again, you'll see on the left-hand side of this chart these things that are closer to the user, whereas the things that are on the right-hand side of this chart are closer to the architect. They are more abstract in nature. But notice the relationship between these various dimensions. Data and architecture spawns one or more data models. Again, the quality of those data models is variable. Each data model spawns multiple data values, and each data value can then spawn multiple data representations. So if I'm working down here at the close to the user end of this thing on the left-hand side of this chart, it's kind of like being at the bottom of Niagara Falls and complaining about the quality of the water. Goodness, what am I going to do at the bottom of Niagara Falls about a data quality problem? I can't wait until the falls freeze over, although interesting enough this year they did for a long time. Global warming, climate change, whatever we're going to call it, is making things very, very interesting. Let's look at a particular structure-oriented problem just to illustrate this section. It turns out that the City of New York has an interesting problem in relationship with their trees. There are about 2.5 million trees in New York City, and in the 11 months from 2009 to 2010, four people were seriously injured by trees falling in Central Park alone. Of course, every time that happens, it's in New York Coast. For years and years, oh, by the way, there's lawsuits against the city and all sorts of other things forget about the fact that people are actually hurt. On this, we can put a risk management context very easily. So the police, the arborists, the people who work with these believe that pruning and other ones maintaining these trees can keep them here and make them more likely to withstand storm, wind, et cetera, et cetera, decreasing the likelihood of damages. Until recently, they had no data to back it up. They took a big data approach to it, interestingly enough, and they called it the big tree problem, which was good. The thing was, pruning the trees in a year reduced the number of hazardous tree conditions in the following year. The first answer to that question was, we can't answer the question, because they had lots and lots of data, but they had data that was structured incorrectly. The pruning data was recorded by block. The cleanup data was recorded by address. The pruning data, of course, had its own unique serial number. No, I'm joking. First of all, don't have serial numbers. They have no unique identifiers in there. And I imagine if the people in the city of New York tried to run around and label every tree, they would get all sorts of vision, so the local residents would be there. However, the group after downloading, cleaning, working, analyzing, and intensely modeling what was going on with the system, they were able to come up and say that pruning trees for certain types of hazards caused a noticeable 22% reduction in the number of times the department had to send out a crew for emergency, create overtime, clean house. Now, the analysis generates another series of questions. Of course, that's what it did to this group. They went back and said, hey, we can't prune every block every year, so what can we do to do this well as part of their data quality initiative in this area? They started building block profiles, excuse me, block risk profiles, so they would say this block is more at risk than this block. Based on the number of trees, the type of trees, whether the block is in the storm zone or other sorts of zones that are in there. So, having the data at the right level of regularity, having these structure-oriented problems here helps New York City reduce the cost of taxpayers of having all the beautiful trees that they have in New York City. Now, we're going to talk about quality and the data lifecycle on this. I'll give you another example. This is from my own bank. The letter from my bank comes in and says, by the way, notice it is undated, just a little skidges for starters. And you can look real carefully at that letter. You'll see it actually looks kind of wrinkly. And the reason for that is because I took the letter and I crumpled it up through the way. I said, I am not a Bank 1 customer. I'm not a Bank 1 envelope. However, when I opened the mail, notice it says, by the way, you may have heard about a recent merger between Bank 1 and JP Moore Chase. I actually haven't, which is why I threw the letter away in the first place. I saw something on the news, and I was able to go back and retrieve this letter. Yes, that's how long I've been doing PowerPoint. So I pulled the letter out of the trash can and you can look here. See, it's got a little bit of problems. That may mean you're just really, really interested in trying to find out whether they could use up the rather sophisticated Bank 1 envelopes and Bank 1 letterhead, but I don't think that was the case. Twice in the letter they repeat, please be on the lookout for any upcoming communications from either Chase or Bank 1 regarding your Bank 1 credit card. Or any other Bank 1 product that you may have. In other words, we'll put the owners on the customer. Not a good idea. So the problems are, of course, I initially discarded the letter. I'm happy about reading it. That seems to me that it's a burden for the customer to say, hey, we're merging and we're going to transfer our name right now. Do I give it any time to switch over on that? Well, I'm going to claim the data. Chase Bank had some very significant quality problems that were associated with the merger that had actually been well documented and solved at the time. Chase is a very good organization from that perspective. Now let's get to the second polling question that we have here for you. Does your organization use what we call a structured or formal approach to information quality? Or are they? They are, but they aren't, or simply no, those are your three choices. So again, take 30 seconds here if you will and Megan let us know when we've got the answer to this. So we can now start to dive in a little more closely into the quality cycles, life cycles. All right, we've got some answers coming in. We've got about 20 seconds left until the poll is up so I'll reveal the answers at that point. Okay, it's participating here on us. Some of you can't actually do it because you're driving in your cars and watching your webinar on your iPhone while you're moving on the hire. No, I'm just kidding. Tony would just be appalled at that if you do. Everybody's safely parked behind their cubicles somewhere and I'm worrying on this stuff. All right. So about 20% of you say that you're working in a structured environment. That really mirrors our findings as well. About 22% of you are working on stuff but it's kind of hot and no, not really in some of the better cases on this. So again, thank you for your input on that. Let's move on and drive into the data quality cycle. Now I mentioned Tom Redman before. Tom is notable for a number of things. He's the first person to use the term data-driven. First person to get a Harvard Business School review to start talking about specifically talking about data quality things and do some wonderful work there. This is original concept. This is a book on data quality and it seems natural. Data acquisition cycles and they have a storage component and you have some data usage cycles but upon reflection we've discovered that it's really a little bit more complicated than that. I'm going to give you our updated model on this and finish it off in a way that I think will taste a little better for you because this is kind of complex. First of all, before you can create data you have to create metadata. So metadata creation happens first where you create data architects and models. That moves us down into an activity we call metadata structuring and then data creation can occur where we populate the models and storage locations with actual data. That data can be stored in our organizations. It can be utilized and when we utilize it we may or may not manipulate it. If we manipulate the data values they can be put back into their stored, restored values, values that are stored again. Then we can look at data assessment and we may want to evaluate that and that may leave us to manipulate it so it can go down that way. We may also refine the data and we take it correct to the value defects and fix them as they go forward. However, in many cases and in our experience it's more often than not the metadata itself needs to be refined and we don't understand how that is refined and don't refine the models and we have no ability to address the structural causes that I stress so much in this webinar. Similarly, we need to refine the architecture as well in there in many cases. So again being careful about all of that. This is our newer data quality cycle on this and showing the same diagram here with just talking specifically about what are the quality issues around this cycle. This cycle didn't end up being very quite right if you will and so they had me redraw it into a very nice cycle this way. This is not quite as precise as before but I noticed two entry points with this new data quality cycle model. The upper left-hand corner is the starting point for new systems development. We create metadata, define the architecture of the models, work your way through metadata, structuring data creation, data utilization, manipulation and for existing systems we also start in the bottom right-hand corner assessment, refinement and then metadata restructuring in order to this. Now this process is what we call re-engineering of data. Re-engineering got kind of a bad term because it became a term of art for any kind of change in most organizations. That was really a bad idea but what we did discover was in fact that we needed to re-engineer this. There was a full definition of re-engineering which means first using and understanding the existing strengths and weaknesses of your data before you go back out and use them again because you can't blow up the old system. You need to look at it and say what are the good things about it? Let's repress them and what are the bad things about it? Let's avoid them as well. This is my data re-engineering by first reverse-engineering the data and understanding it thoroughly before we go back out and put new systems in place. This is about that. We'll have to take them at the time of the hour which we are 10 minutes away from as we move on. So let's get to our third polling question. Do you use metamodels, modeling tools, or profiling to support your information quality efforts? Just trying to get an idea of the penetration of these tools. The vendors in general, I'm not sure why I have a copy of a person drinking. Did I do that one? I believe you did. No, you didn't. I have to watch my clip art sometimes. I'll do it, by the way, as I go out to Google Images while you guys are filming this. I'll tell you this. We will go out to Google Images and type in the keywords that are on here. So I'll probably put in something like information quality. Got this guy drinking back. So that's clever. What do we got the way of responding to this, Megan? There's still coming in. I want to get left. Peter, I hear you. I'm not sure I heard it correctly. So let me get the whole video up for you. Yeah, let's see. In just a second. You can see what it is. It's a simple question again. Oh, okay. Okay. So still, 33% of you are using tools, which is great. We don't know about 30%. And the third of you are not quite using tools yet. So let's dive in and look at specifically at the tool sets that we have in this area. I have to tell you that the key was work that we did when I was at the Pentagon in the early 90s. We funded some research that went in to develop the basic algorithms around data profiling. We did this because we had done a study at the Department of Justice that said we had some quality problems as do most organizations. But the DOD was spending a lot more money than most organizations to do this. So the work was done by a Ph.D. out of Columbia named Nina Bitten. She's a wonderful colleague. We owe her gratitude for helping to develop this particular approach. Notice I'm showing on the chart here that you could do that in top or bottom up. Or if you want to do both of them, that's great. Top down. Of course comes management. The bottom up comes from you all who are the people who are really concerned with the data in there on this. It's easier to do it in top down because you have a lot of management support in order to do this. But many times you start out with bottom up in order to do it. Now, for what we want to do is look at measures that are around this strategy. Again, finding a solution of the organization don't try to be comprehensive. Again, an aspect to the business, one sort of critical in terms of the impact. And let's evaluate the data-dependent elements to create the processes that are associated with it. Remember, you can't look at just the data. You have to look at how the data gets through these systems. You need to be able to list any associated data requirements that are there, which people probably don't know about as they're inputting the data. And then you want to associate the dimension of data quality, which will preclude and include certain specific approaches to do this. You want to drive the process for measuring conformance and specify some sort of acceptability threshold. I'll tell you a little quick story here about a hospital in the middle part of the country whose director came down and called an all-hands meeting. And said to everybody, hey, hey, hey, you guys. We're going to do some really interesting work here at this hospital. We're going to do knee surgery. And you guys have been doing so much knee surgery that it's just phenomenal. We're going to have tons and tons of things. We'll be the knee surgery capital of the Midwest on this, unless the room, the physicians in the room kind of looked around and said, I'm not doing that much knee surgery. Are you doing knee surgery? No, I'm not doing that knee surgery either. I wonder where the information came from. Well, guess what? The default hospital admission code was knee surgery. Oh, my goodness. Here's a very smart administrator who looked at the data that they were getting and said, we're doing so much knee surgery, we need to expand our efforts. Not realizing at all that this was a practice issue that could have been corrected very simply by allowing no default entry code to the hospital, forcing people to pick something on the list and probably providing higher quality data because a part of this. Now, the important part is evaluating the various data quality service levels and it's open to speaking formal terms here and to use service level agreements. If you can set these things up, they can start to be put in place as to whether these things are correct or not. If you just say the data sucks, and I was working with a good friend of mine this week who works at DMAS down in Richmond and her story was very, very nice. She said, well, when I got here, they told me this data sucked and I said, what does data fucking mean? I said, I don't know. Our users don't like it. They tell us it sucks. So she went in and measured it and found out their data was only 2% in error. It went from the data sucks to the data is 2% long and 2% they can now make a business case as to whether they should in fact fix it or not. Again, you can put together the picture that you need to do here. If I've got an agreement in place, we can measure to that agreement. If we're not reaching our agreement, we can decide whether we need to relax our standards on this or whether, in fact, we need to go on and increase the effort that's being made to these areas. Excuse me. Again, measuring and monitoring the data quality we can look at it from the context of business rules. We can apply the granularity to data element values, instances or records, or to the data set as a whole in order to do this. Excuse me. Sorry about coughing towards you guys. So here's our overview of data quality tools. They really fall into four categories. Analysis, cleansing tools, enhancement tools, monitoring tools. The principal technologies are profiling, parsing standardization. I'm already giving you a parsing example on this. Data transformation, identity matching, and resolution, which turns out to be a huge component of big data. So as you're looking at that, we'll probably get around to doing a big data and quality webinar in one of these days. Unenhancement and reporting of data. Let's look at the data profiling first of all here again. These are the algorithms I described to you earlier. On which many tools are based. You can take these tools and look at any structure that is data and come away with a logical third normal form of the model, which gives you an understanding of what the structure should be and you can compare it to the structure that is and decide what is the right structure. In addition to that, of course, you can look at the various values. If you're not familiar with these tools, they are working out. They're often called discovery or data analysis tools. If you're a SaaS programmer, think of Proc Freak on steroids. Really, really good ways of describing this or just using some of the functions of Excel in a pivot table. By the way, you do not need to settle a lot of money on these tools. Again, Excel will do a tremendous job for you if you're able to do this. Even down to the iPad implementation of Excel, which is a really, really beautiful piece of work. Notice to Microsoft, sorry you blew it on PowerPoint Word, but the Excel port to the iPad is a really, really nice process. Again, we can understand the context of all the semantic layers. We can actually do different types of modeling with it and understand it so that we now can factual information about our profiling tools. One of our favorite tools here is a company called GlobalIDs that does everything that you see here actually in this in terms of how it does. Interestingly enough, the company has put together using a big data technology. So we're using big data to apply to data solutions, which I just love. One of our favorite tools that we did on our Tech Fridays is a data blueprint right here. The second tool is the parsing and standardization components. This is the idea that we're trying to get to things where we understand which of these things doesn't belong in this list, right? And then we can list, again, simple things like look at streets, and there's four different ways of implementing the word street out there. In datasets, if you're parsing for streets, you need to be able to have four different queries or at least four options that can be returned by that query. Data transformation again here is the idea that these are sort of automatic things when we go to Mexico as to why. I think of this as an ETL, but an ETL focused not solely on movement data, but in fact on transformation of the data. The identity matching and resolution component here is a simple piece of work. We can do this in a number of different ways, understanding exactly how everything goes on all these tools so that we could find where are things associated with other things. Again, this is getting us towards the structural. You can see there's a deterministic and a probabilistic component. Again, I won't do these as a reference, but we can enhance those tools. Again, another component we look at where we say can we enrich the data. I know it comes in just saying 42, but if I get out of tagline 42 in the context of the meaning of life, universe, and everything, we would now have a really good idea of where that goes, how we're supposed to use it. It in fact should reside as we're looking at it. And then finally, reporting tools, which people have been using for years and years. Show me all of the trades that took place. We had one yen in value. Well, again, it's an absurd proposition. If you're in Japanese trading, that's way below penny stocks in terms of doing it. But by God, if you can put stuff early on, you will catch a lot of errors. And the really interesting thing about the Japanese culture is that when I visited the security full 10 years after this, they were still, I don't know what to say, but they were cognizant of the fact that their company's name had been associated with this. And they had really taken this to heart and changed the culture of the company. So magnificent story. Yes, they had a new policy, but boy, they really get how to do this in the long run and have done a tremendous job ever since. So we'll start over here again. The key is you really do want to start to promote data quality awareness. You want to understand these various requirements that we're putting in place. We really, really do want to get in and figure out how to start using any tools. You do not have to spend a lot of money on data quality tools. Excel doesn't cost much money, and Excel is a very, very viable data quality tool, particularly for diagnosing the problem that's in there. So you can start to identify these metrics, the business rules that go in with it. You can start to test and validate the specific data clearing requirements. You can start to implement service levels informally within groups. You do not ask for, you need to ask for permission. It's better to ask for forgiveness in this case. You can set up things that start to measure and monitor data quality. In fact, what I like to do with organizations is to say to them, take this sign, and say, I am responsible for all data quality in my organization, and put it on your desk, or your cube, or the bottom of your email signature, if you work virtually. And sooner or later, somebody's going to come along and say, hey, I want you to become responsible for data quality for this whole organization. And you say, well, if it's not me, then who? And pick up the sign and hand it to them and say, you can give this to somebody else. It's a tough job. It requires a lot of work, and it's nontrivial in terms of its impact on the bottom line. Well, if that's the case, here's your sign. Now, what do you do with that? So back at the top of the hour here, again, data quality engineering is a process to understand how we are going to apply an engineering discipline to the process of improving our organization's data quality in this, and that we get to the part of the program that I really enjoy a lot, which are really questions and answers. Maybe we'll come back over to you. Thank you. That was a great presentation. It's time for a Q&A. Time for you to ask your questions. So just click on the Q&A window feature. At the top of your screen, you should be able to submit your questions through that Q&A window, and we've had a few come in already so we can jump right in. The question is, how would you counter Mr. Scott Ambler's contention that we, data people, are nothing but obstacles to flowing out application systems as I have you. Adju. Adju. Thank you. Yes. And quickly, as programmers, don't want to. First of all, I don't think their quoting's got correctly on that. Scott has presented at our conference in the past. He has a wonderful speaker, and if you have not had a chance to go see him, I do think that he understands data a lot more than the question implies there. But he is correct about one aspect of things, and that is that if you simply sit around and whine about what's wrong, instead of actually trying to do something about it, that will not work. So I'm literally again, I'll cross the river from the Pentagon where I worked for a number of years on data quality problems for the Department of Defense, and we were able to solve things. And I never met a group that couldn't go in and make some impact in data quality in these areas. So I don't think that Scott actually characterizes it that way, although I do understand how in your middle or agile process, and as it was a good process, it works very, very well in order to do this. But that if you have to develop a new database in the middle of that, the only possible outcome is that you will end up with yet another pile as redundant obsolete or trivial data in your organizations. The average organization has their customer data spread across 17 different databases. That's the average organization. So while Scott is, I think, correct to focus in on the good things about agile, and also to say that we need to do more than just whine about the problem, I think he's actually done some very deep thinking around that process. Again, I don't think that his characterization is opposed by the questionnaires, but I do understand that on this. And we data people do have their mutations for sort of complaining without actually doing. The groups that I've seen successful take those complaints and turn around and do something about it. And when they start to do this, it becomes the bottom-up approach to their data quality efforts, and they can in fact achieve some very significant results. Scott has written extensively on this subject. He's got a really nice website out there. So if you want to carry on that particular dialogue, give him a ring on that. I'm sure he'd be glad to respond. He's a terrific individual. The next question is at CRC. It's a cyclical redundancy check. Thank you for asking that. We used two CLAs. That would be three letter acronyms on these talks as well. So good for catching it. A CRC check tells you whether a batch of data came through correctly or not. So if a frame of a movement comes in and it has an error in transmission that introduces some noise into the process, the way the TCPIP protocol works on the Internet is that it will recognize itself because the CRC check is correct. It will say, I didn't get a good block of data. And what you will see is that that block of data will be retransmitted from the last top to the present top. That's not the only use for CRC data, but that's how it works on the Internet in order to do that. And that actually mirrors the way the physical world works in many, many ways. There's a terrific book by Dawkins called The Gene where he talks about this issue entirely in there. It's a fascinating, fascinating world that would get too esoteric for us definitely. The next question is, how has the situations with the NSA had an impact on the way people think about data governance? I think a couple of different things into that question. It's a really good question. Of course, talking here as I am in Washington, it's a question that's on a lot of people's minds. So let me give two perspectives. First of all, the word metadata has now come into common use. We used to say, oh, don't ever tell managers about metadata. But when the President of the United States has press conference after press conference, we're just talking about the fact that they're only gathering the metadata. They're not actually listening to your phone calls. I don't know anything about what the NSA is doing. We'll make that perfectly clear on that. But it does start to enter the vernacular. People do start to understand it. And it becomes more of a popular topic. So I think that, in general, the awareness of data as an asset has been a very big help. Let me talk about another aspect of this noting thing that many of you may not be aware of, though. And that is Glenn Greenwald and some of the others who, in fact, talked to him about this, talked about the way he established his credibility. Because you could imagine just getting a call with a blue and this guy says, hey, I've got some secret stuff from some places inside the federal government that they don't want to know about. And I'd like to give it to you and become a whistle-blower, in that sense. They said he had given them various flash drives that were unbelievably well organized. That there was an absolute taxonomy of the information that he had organized. And carefully, I'm going to use the word curated because it's an absolutely appropriate term for what he had done. He put together probably one of the best taxonomies of what was going on in the intelligence community that, frankly, I would pay money to have just so I don't have to recreate it the next time I work in that particular case. Then how he established it. Greenwald said that when he would go down a directory and say this is about subject X and subject X has three components, X1, X2, and X3. And when I go down into X1, there are seven components of X1. And again, they're all laid out methodically, carefully, and carefully from a metadata perspective. And Greenwald said there was absolutely no question at all that this guy had the information that he was claiming to have and be that it was really, really significant stuff. As we all know, it is now taking a number from this. Now let's take another part of it, though, which is to say if the NSA is monitoring metadata phone calls and they've pretty much admitted that they have some sort of specs of that that they're doing, I have a phone call. I have a phone call to Megan. You know, that could be a thing that might actually be hazardous to Megan's health. Megan, I'm certainly not threatening. Please don't take it that way. But you know what I'm talking about. Somebody decides that they want to follow me on my phone and I give it to you. That could be a problem. So, again, opening up other whole worlds of possibilities in order to do that. I'm going to say one more thing about the smartphones, too, is that the smartphones can actually identify the gate that we use as our walk. As we walk around, our gate is as unique as our fingerprints, our voice prints, and our red scams. This is really the basis for wearable computing. Everybody keeps talking about eye watches and things like that, which are good things. We look forward to technology developments to come out there. But more importantly, when Megan can actually have something that says, I know that it is on Peter's wrist, that means authentication is going to become a much easier task. And it will become the most onerous. We won't need 64-character passwords that we have to put in place. And then, of course, right now, it's defeatable for the purpose of security. And by the way, quality of security has huge implications. If your customer data is in 17 places, how would Target secure it? I'm not saying Target had their data in 17 places. But they clearly missed out on the fact that their main customer database, which is an awful lot of data on their best customers, was leaking absolutely. It'd be kind of like coming home at night and seeing that you had a gas tank. Some of them had gas tanks empty by the next morning. Of course, it wasn't really a matter of emptying the database. It didn't have racial customer data. But still, it was a very, very, very large volume of data that kind of leaked out through the air conditioning ducts if you will in Target. I'm really sorry to see that their DIO took the hit on that one. Somebody, you know, was down to have their cutoff on it. But I don't know if the DIO would have the person to blame for that particular instance in there. We'll get to the topic. What's the next question, Megan? I had someone asking about who the author was that you were talking about from the Harvard Business School. Yes, absolutely. Tom Redmond is a friend and colleague. You can see his profile on our website out there, datablueprint.com. Tom has been a wonderful friend for many years and actually was inspirational in forgetting people to think about this. So the original book here was Leviton and Redmond in the 1990s. If you Google that, it'll come up. But Tom has a much better book on the topic which is called Data Driven. And in fact, Tony and I remember a couple of weeks ago, Tom had a really nice talk on this particular webinar channel. And it was very nice. So that's the second question. It's a great speaker. The next question is, how to reconcile agile development approach with need for design analysis which has its lifecycle and not agile most of the cases? Oh, somebody asked that question. So we've been perpetuating a fiction in the academic community here for many, many years. And the fiction is simply that data and software can be developed at the same time. It's absolutely incorrect. It can't be done. It never should be attempted in the first place. The only time it can work is when your data does not cross the boundaries of the existing system. So I'm going to show presentations here and pull up another chart just to illustrate this process. I hope I can do fast enough so that you guys don't get bored in the process. But you will be sure to include this in the course we get them out there. So hold on while I'm literally disturb you guys because I'm flipping through a thousand slides here real quickly. And let me find the one that I'm looking for. Getting close to it. There we go. So the key, of course, the question is how do we reconcile the fact that we teach everybody in college and university who takes IT courses? The data and process can be developed at the same time using the same cycle. It is wrong. It's shouldn't. It never existed in the first place. The only reason it was written that way is because somebody grabbed an article that was incorrect and reproduced it. And that happens a lot in the academic world as a problem. When we are developing systems, we are fundamentally working with an activity that creates things. And what it creates is we identify a business gap, a capability that the business doesn't have that we'd like the organization to have. So we create that capability. That is the fundamental focus of systems development. Data doesn't work that way. Data exists. Data evolves. Data is measured in years. System development activities are measured in two-week spreads in some organizations. They are fundamentally incompatible. The only possible result of following a simultaneous development of data and software unless a system only, unless data doesn't open the system boundaries, is that you will create more desperate calls of data. Data evolution is separate from Excel 2 and must precede system development life cycle activities. If you do anything different, you will end up with S and that's what we've been doing in fact, looking at the mess. So I'm going to grab that slide and put it in here. Since the question came up, I'll second the appendix of these slides so you guys can have a copy of it. That answered the question. It's a little bit strong but we do need to recognize that if we continue to teach people the wrong things, they will continue to try and go out and do what we teach, which in this case turns out to be absolutely wrong. Some other reference material in here by the way, just so that you all know, there's recommended readings. I've got the full dimensions of the data quality dimensions and things that are in here as well. So please use this data as a reference text for you. Again, data quality dimensions, data value, representations, model qualities, architecture qualities, et cetera, and so on in this particular deck. Next question. Next question is, what would you say is the largest roadblock to data quality? Well, an interesting question. I think that the largest roadblock to data quality is the fundamental perception that IT is responsible for data quality and it really is. Data quality is not an IT function. It's a business function. IT is not capable or situated to prove value when things are wrong and bad quality data. IT is not able to understand how to do this because we've taught them correctly for years and years. And finally, IT is perfectly that all of IT projects be managed as a project. Data is not a project. Data persists. Projects have a beginning and an end. Data by definition generally has a nice cycle as you see, but it's a beginning and an end. It's certainly not with a development context in there. So the understanding of the business thinking that IT is taking care of data and IT is saying, I have to log on to the servers. My job is done. It's going to continue to become a major fundamental hurdle because then they turn around to the CIOs who are already slammed with tons of other things to do. And we tell the CIOs, by the way, you need to fix data quality. But they don't have the knowledge, skills, and abilities to do this. The engineering disciplines and to the architecture is not, in many cases, understood at the data level where it should be in order to make a difference. And I have a strong opinion on that as you can tell. Feel free to disagree with me on this. Great. And the next question, is data quality really necessary in an integration project? Well, if you're integrating bricks, no. In the integration definition of systems, generally means that systems have their associated data. And it's relatively easy to integrate these systems from a technical perspective that the data by definition must be correct at its most granular level. And if you move a number into a text field, no problem. But if you text into a number field, it doesn't work. So if you remember the old OC4 or SOC7, we used to get when the main frames blew up. It hasn't changed. If you put the wrong type of data in the wrong type of field, it doesn't work. Unexpected things happen. Outer formalities occur. And when that occurs, it's a big problem for everybody. Next question. Okay. We're actually out of questions for today. 15 minutes. We knocked them out quickly. Yeah, we ran right through them. Thank you, everyone, for participating in today's event. We hope that you've enjoyed it. Thanks again to Data Diversity for hosting us. Once again, you will receive today's materials for the next two business days. Our webinar next month will be Data Architecture Requirements. Hopefully you'll be able to join us for that as well. As always, feel free to contact us if you have any questions. Thanks, everyone, and have a great day. Well, let's not forget, too, a bunch of us are going to be meeting in Austin at the end of this month at Enterprise Data World. We're at Lewis Brewers Business Partner, and I'll be talking about developing a data-centric strategy and roadmap, along with gosh, how many other presentations do we have on this? Probably some hundred that we're going to have. It's one of those things where you go into and you never feel like you have enough time or been told to go in and see everything. Some of what most organizations do is they send a couple people, so they can at least multi-track in order to do that. But again, our big conference of the year in Austin coming up in the next month. We hope we see everybody there. Thanks for listening.