 My name is Erica, I'm a Data Editor for Data Diversity. We would like to thank you for joining today's Data Diversity Webinar, Unlocked Business Values through Data Equality and Engineering. This year's June edition in a monthly series called Data Ed Online with Dr. Peter Akin brought to you in partnership with Data Blueprint. Now we have the full Eileen Berkowitz, the webinar organizer from Data Blueprint to introduce our speaker in today's webinar. Eileen. Great. Thank you so much, Shannon. Hi. Welcome, everyone. We're so excited, as Shannon mentioned, that you just found the time to join us today. Thank you, as always, to Shannon and Data Diversity for hosting us every month with our webinar series. We started in just a few moments after I introduced your speaker and also let you know some housekeeping items to let you know what to expect. As usual, we're planning one hour for the presentation and then you will have 30 minutes to ask your questions and make general comments. Try to answer as many questions as time allows at the end. Just think of the question as you see Peter present. So feel free to just submit them through the chat as you can think of them throughout the session and I'll keep track of them. And some of the questions that we get asked most often, you will receive an email with the link for download today's materials and any other information you request within the next two business days and you'll be recorded. So you'll get everything from today's session. Also, you can find us if you have any questions on Twitter and Facebook. We've set up the Hack Data Ed on Twitter. So if you're logged on, feel free to use it in your tweets and also submit any questions or comments you have that way. Now, let me introduce you to our speaker. Many of you probably already know of or have met Dr. Peter Bacon. He's an internet recognized thought in the data management field and he speaks and presents at conferences nationally and worldwide. He's more than 30 years of experience and has received many awards for his outstanding contributions. He is also the founder and director of Data Movement, our company as well as the current president of Data International. To date, Peter has written eight books and articles. The most recent book is just hot off the presses. It was released in April and it's a topic of the need for writing to achieve data officer today. It's a very, very interesting topic going on right now. And the best way to get yourself a copy is through Data University's new bookstore. You'll receive more information on how to do that in the follow-up emails. So back to Peter. He's experienced in more than 500 data management practices in 20 countries and consistently named as one of the top 10 data management experts in the world. He has spent more than 10 years with organizations as servers as the US Department of Defense, Deutsche Bank, Nokia, Fargo, Commonwealth of Virginia, and many more. He has often requested a conference and workshop and is always traveling to numerous speaking engagements. So Peter, usually we have to ask you where you are, but I know where you are today. We're very happy to be in the new office space for Data Blueprint. We've doubled our square footage size, so I guess that's a good thing coming out of the reception. Welcome, everybody. Today's topic, Unlocking Business Value through Data Quality Engineering. As always, I want to acknowledge the Virginia College of School of Business where I'm a professor there as well. I'll go right into the material and take a look at what we're going to do today. First of all, I always start out with these seminars with a data management overview to give you some context on where we fit in. Then we're going to talk about data quality engineering definitions and I'll give you a very specific example of that. We'll get the data quality engineering cycle and give you some little complications we have with data quality. We'll then look at data quality causes and dimensions and then quality and the data life cycle and we'll finish up with a sort of romp through some data quality tools. Of course, then we'll finish at the end of the hour with some takeaways and as I said, the part we look forward to the most which is getting everything up and running and talking to you guys about some of the questions that you have which are always a lot of fun for us. Now, as we start out with this very complicated diagram that if anybody ever really wants me to go through it sometime, I'd be very happy to do it. This is the non-management version of what we mean by data management here and I'll simplify it for us. We'll get some specific integrated data management practices, practice areas. The first one is data program coordination. That's the idea that we need from the same sheet of paper because when we look at organization after organization, we find heroics, people who are down in the ballot IT who are trying to do data management but they don't know that somebody else is trying to do the same thing and so their efforts are simply not coordinated and of course, if they're both working towards the same goal, the efforts would be more efficient in that area. Function then, talk to data manager. Practice area is sharing data across organizational boundaries. Again, your organization is very likely doing data transfer between one part of the organization to another between a program to a program or between your organization and a partner organization. We want to make those things as efficient and effective as possible because if we don't, first of all, the process becomes much more difficult but second of all, it just isn't a good expenditure of organizational resources. Third area of practice in data management is stewardship. This is a sign of responsibilities for the data. If it is not personally responsible, if it is not Peter's responsibility to fix the customer data, then it's everybody's responsibility, which means nothing happens. So that's been the case in many areas. You'll see data stewardship now as a growing area and we have a topic we do later on in the year for this. Development is a fourth area here. This is the ability to engineer data delivery systems and engineering and architecture come up time and time again in these areas but they are woefully missing from our educational offerings in this area. And our area is data support operations. This is the idea of maintaining the availability of data within these organizations. And if you were lucky as an IT, and more importantly, not if you were lucky, but if your boss was lucky or your boss's boss was lucky, but a single course in their in-data development but most people do not have any education in those areas at all. So we'll talk about now the different areas as a basis, a foundation for what we want to do with data, which is leverage it. Now I always use this picture of Maslow's hierarchy because data works very much in the same way. And that's a view that remember Maslow from high school, this was the idea that if your food, clothing, and shelter needs are unmet, you're unlikely to sit down at night and create the great American model. Again, that's just without a good foundation. It's unlikely for the rest of these things to happen. And our five data management practice areas are critical to doing this. What everybody learns about data is what gets written about in the press. Advanced or self-actualizing data management practices, whether they are cloud, master data management, data mining, analytics, warehousing, big data, doesn't matter. If you do these things in the green area of the triangle without first providing a good foundation for your data management practices, it will take longer, cost more, deliver less, at greater risk to the organization. Of course, are wrapped up in our data management body of knowledge. Again, Eileen mentioned that I am the president of Data International, and this is our definition of what we mean by data management. These areas, Data is an international organization. These areas were developed by a group of dedicated volunteers that really made phenomenal efforts. In addition to just knowing what the knowledge is now, we can also talk about becoming certified as a data management professional. Again, I've got a couple links up there in the site. We would like you to join the growing list of about 1,200 professionals worldwide who are now certified as data management professionals. And more importantly than that number of certifications, we're also seeing a lot of job postings coming along where it states CMP preferred, and now that we know we're making a good impact in there. Those of you that are familiar with the DEMBOC understand that each of the chapters of the DEMBOC here have a typogram here, and that's what we're looking at on the screen, inputs, activities, or processes, and outputs on the other side to define the scope of what we mean by quality engineering. And we can also throw our five data management practice areas into the middle there as well with the same kinds of results on this. Let's dive into the material now. Data quality definitions here. First of all, I'm going to describe to you a model of what we mean by data. Data, of course, is a random fact and useless at all unless we pair it with a meaning. So if I give you the number 42, that was my age 12 years ago. You have a meaning that you can pair with that particular fact. That is not really information yet though until we understand that this is the nature of a request that an organization might make. And what we really want to do, again, these self-actualizing things we talked about a minute ago, are the game to what we call intelligence, depending on which deck we also called knowledge and wisdom. Interesting terms as being able to use in there, it really just depends on what decade it was that we did. But the idea is that data, information, and intelligence need to be built carefully with an architecture and an engineering discipline and that if we don't have a good foundation, everything else is built on a foundation of sand. So a big definition of data quality. Martin Kepler was the first that I heard to use this term fit for use. It meets the requirements of the authors, the users, and the administrators. I have a quick little Popeye story on this as well because many people are familiar with the character Popeye from the cartoons. And I was always eating spinach. And as I was doing a little bit of research for this, I found out that it turned out the reason people thought spinach was important was because of a data quality error. It turned out the researcher who was studying the value of vice and iron in particular that were involved in spinach made a data error and was off by a factor of can and order of magnitude. So for years and years, including the whole development of Popeye, people thought that spinach was really good for you. Spinach is good for you. There's no more good for you than kale or any other leafy green that is out there. It's a little data quality story out there. And unfortunately, we have to understand that this data quality is synonymous with information quality because of the slide that I showed you before and the very key relationship between data and quality. They are synonymous. If we have poor data quality, we will have inaccurate information and more importantly, poor business performance. Talk about data quality management here and you can see the conditions that we've got up, which are planning, implementation, and control activities. Great, that's good. Or in terms of the establishment of roles and responsibilities from a person perspective on this. The real key pieces here are that there's a change management component relative to data quality, as in which version of the truth am I looking at at this point in time, and that it's a continuous process for defining acceptable levels of data quality to meet the business needs. And most people don't understand this. They think it's something that can be fixed, whereas it's not. It's a continuous process that is ongoing for organizations. So let's move to data quality engineering and this is the recognition that data quality solutions can't be managed, but they must be engineered. If we try to manage data quality, it gets away from us very, very quickly. And engineering is this application of scientific, economic, social, and practical knowledge in order to design, build, and maintain solutions to our various data quality challenges. Engine concepts, however, are generally not known and not good by IT or the business. So let me give you a very specific example on this as well. This is a customer that we've worked with a couple of years back that had a catalog information that had knots and lots, millions literally. FKUs, FKU is a stock keepers unit or if you want to look at it in another context, it's an NSN, a national stock number that were maintained in a catalog here. And the key for this information and other data about the data was stored in clear text or comment fields. Again, not a very useful, certainly not a very accessible way of doing this. So the approach that people suggested in order to clean these was a manual approach. And while that was a fine approach, it did not apply engineering techniques in here. And the question was how much could we do with an engineering approach and how much should be left to a manual approach? It also left the data structuring problem relatively unsolved. So the solution was to develop a proprietary, improvable text extraction process. We would convert the non-tabular data into tabular data. And by the way, those are the preferred terms to unstructured and structured data. When you hear people talking about structured and unstructured, data was unstructured, you couldn't structure it. That's the definition of unstructured data. So I don't like that. Here's the business value. We were able to save this organization $5 million. And I'll show you why. More importantly though, it was the first time I've ever worked on a project where I saved a person century. A person century is kind of an interesting concept if you hear person weeks and person months and person years. We got some person centuries out of this one. So let me put up some numbers here, and I'll show you the engineering approach. Now, the key for this process was to determine when we reached diminishing returns with our algorithms that we were putting in place on this. And the first week we did this, we didn't do very well. We didn't really match anything. That's what I'm showing here. But by the fourth week, we had actually matched 50% of the problem, 55% in fact. And we were also able to improve our number that we could ignore. In other words, of all of the millions of items that we were looking at, by the fourth week, we had determined that we could ignore 12% of them. That was pretty good news, too. And we didn't have to spend any effort to clean those items. And the number of unmatched items kind of varied a little bit here as we worked with our algorithm. So we've done value. That's great. When do we stop? Now, the interesting part about this is that we're holding certain costs fixed. And that was the weekly of software engineering that we put into this, and data engineering I should have had as well for each week as a fixed cost on this particular project. Notice that the unmatched items by the week 14 had dropped down to 9%. And in fact, by 18, we'd actually dropped down to 7.5% here. So looking at that, we were seeing that they were getting lower. And the question was how much lower would they go? Ignorable items, we pretty much got to 22.62% of them, we could ignore entirely, which was one-fifth of the problem, we could completely discard. Our improvable items, the number that we were matching, went from 68 to 69%. So you could see we were clearly approaching a point of diminishing returns. And in fact, the original problem space had gone down where we were able to match 70% of them, ignore 22.5%, which only left us with 7.46% of the problem originally. Now, let's look at this from a quantitative perspective. We took the two NSNs and added five minutes to cleanse and review them. We multiplied that time the number of work weeks in a year and the weeks in a day and the rest of the things that we put in there. And probably the average salary for a SME on this. And that's where we got our $5 million in saving. The person years, you can see a little above that, is the fifth or sixth line from the bottom is 92.6 person years on this. Of course, one of the important pieces on this chart that we do and a little bit of sub-engineering on this was the number of five minutes to cleanse and review. I don't know about you all, but you simply can't usually fix a problem in five minutes. It doesn't work that way. So consequently, if we double that, we ended up with two person centuries and again, $10 million, 15 minutes, right? It gives us some centuries and $15 million on this. So some very tangible value here by applying a combination of engineering and automation to a very, very intractable problem that we had before. One of the things that we see here is that error we were looking for. Look at that. Right there on that slide. All right, we'll make a note of that. Slide 20. We see these things out there when you look around in the web and it's just fine. You've got six misconceptions about data. You can fix it. Data quality is an IT problem. Data is a source entry or data entry problem. A data warehouse will always give you a single version of the truth. A new system will give you a single version of the truth and standardizational fix all of these things. And there are wonderful things to think about. But they are much more aspirational rather than reality-based. In reality, when you look at this, it's much more like the story of the blind people and the elephant. You've all heard this story. There's a lot of people, and the first gets up to the elephant and he feels the broad side and he says, oh, the elephant is like a wall. The second one feels a tusk and he says, no, no, it's like a spear. And the third one takes the trunk in his hands and says, no, no, it's like a snake. And the fourth reaches out and feels the elephant's feet. And he says, oh, my gosh, it's like a tree. So everybody has their own perspective on this. And yes, this is true about data quality too. Most organizations approach data quality in the same way the blind people approach the elephant. They only tend to see the data that's in front of them relative to one particular process. And that is a problem. There's little cooperation, it's boundaries, it leads to confusion, disputes, and there's... I was on a project one time where we literally had to wait for an individual to retire. That's a story I'll have to tell you again, but it literally was a multi-million dollar decision for this organization. So our solution is that data quality engineering can help us to achieve a more complete picture and facilitate cross-boundary communications on these. Now, maybe you know that we do some polling questions on this, and after this slide, I'm going to ask you all a question to participate, because we're interested in how it's going. We've been taking these measurements over the past couple of years. But let me just give you before we get there a structured definition for data quality engineering. That is that it allows the form of the problem to guide the form of the solution. It provides a means of decomposing the problem, and this is a variety of tools for understanding the system that we're dealing with. There's a set of strategies for evolving a solution. It provides criteria for evaluating the various solutions and facilitates development of a framework for developing organizational knowledge. So my question for you all is, does your organization address, or plan to address formally, data quality and information issues? We're giving you four choices here, because as I said, we keep these things running for a year, and we're trying to see which way the trend is. So we'd love you to tell us, did you do it last year? Are you doing it this year? Are you going next year, or are you hoping to do it next year? We've been asking those questions for a number of years. We'll give you exactly the same answer, and it says 0.7 minutes to answer this question, because that's the optimal amount of time to get the responses from you guys. We're getting responses. I should remember to put the Jeopardy! theme music up on this one. Yep. Yeah. Share. Okay. Everybody's hitting the button and nothing's happening, right? So we're getting some good responses here. 19% of you did last year. 32% are working on this year. 6% they hope to do next year, and we didn't get any response at all from a third of you all. So thank you for your participation on that, and let's move on with the presentation then. So we're going to get a quality engineering cycle and a little bit of expectations that occur on this as well. And I'll give you another story here. I was in Japan last year, and I went to visit these guys. It's a company called Mizzouho Securities, a very fine company, but they had an error that occurred on them that became infamous around here, and I'll tell you the story here. They had a trader who wanted to sell one share of a company called Jaycom for about 600,000 yen. That was about $5,000. It's a fairly simple transaction if you're on a good day. Unfortunately, this trader sold 600,000 shares for one yen, a little bit of flexia in there. Now, this resulted in a $350 million loss, and when I visited these guys, even though this was clearly years ago that occurred in 05, they were all literally shamed that this had occurred, that this had been allowed to occur in an organization, and the reason it occurred was because the in-house system for doing trades did not have any limit checking. Clearly, they're selling stuff in shares, and they're measured in fairly large amounts of money, and yen should not be allowed as a price. It's just a low, low price for anything that's in there. Similarly, the Tokyo Stock Exchange, where the trade went out, didn't have limit checking either, and they were able to use the line just to make this a very, very noted error in the 30s industry, and everybody, because of the public, around and improve their systems, their data quality engineering systems as a result, so that they were able to hopefully avoid this type of problem in the future. First, we can't be totally proactive about it, and again, I see a lot of things out there, and the web to read, where four ways of making your data sparkle, or prioritize the task, involve the data owner, keep future data clean, and align your staff with business. Well, great advice, but it's kind of one of those great, test-great, less-filling kind of things. There is a data quality engineering cycle here, and let's walk through it very briefly. It's based on the Cycle Plan, Do, Study, Act, or Plan, Do, Check. Again, we're going to identify, define, identify again, and define business rules crucial around these areas. Let's take a look specifically here. Again, planning, we're trying to say, what's going on, what's the cost and impact, and what are the various alternatives that we're going to have to look at as we go through this. And then to process for measuring and improving the quality of the data. This is where the configuration becomes so critical on this. Now, they might start out by doing inspections and monitoring, and fixing things as you find them. The monitoring portion of it here says, well, we've got some business rules that we're going to have to deal with. So, in this example, we're going to be looking at, we're going to be looking at the Mizzouho securities example. We clearly would not have allowed anybody to sell a share of stock for one yen. That's like a penny stock. And if you're on the real change, the penny stocks don't count in there. Our fourth piece, again, acting. Now that we've looked at this, and we figure out what happens, how did that work, and then what do we do next? Now, again, very straightforward in terms of putting this together, but let's add in another complication on this, which is that much of what's been written about in this area says things like you saw on that sixth slide earlier. Well, we won't have any data that we will fix unless we fix all of the data. Well, my friend, Edmund, likes to think of data comprising a lake. And while it's okay to clean the data in the lake, that's terrific. Do you see all of your data 100% perfect? And if you clean the lake, where's the actual source of the pollution? If it's upstream, whether it's upstream systems or something else that's happening externally to your organization, you probably need to fix that as well, or you'll spend all of your time cleaning the data in the lake over and over and over again. So our old friend, Pareto, which gives us the 80-20 rule, it says that not all data is of equal importance. We have to add into that, again, this scientific, economic, social, and practical notch in here in order to figure this out. Now, we were at another company at one point where we were able to actually stop production, because we did discover that this particular data quality error that they were dealing with before the organization to in fact stop production of things. So when we had a data quality error, the system would literally lock up and they'd get a screen popped up in front of them that said, you need to fix this now before anything else happens, because many things downstream tend on the accuracy of this data being correct. Again, that's not necessarily something that's going to work for organizations, so you need to evaluate these types of decisions rather than simply wean them off a website and saying, sounds great, let's do it to understand how this is going to work in context for you. Another area that's occurred in here as well is that data quality is now being seen as a very significant amount of risk by certified risk professionals, people whose job it is to manage risk from an organizational perspective. Now look at things, and here's some notes from a project that we were on at one point where they said the blank has four different databases, each of which may contain the same customer multiple times with active open balances. They rated the quality of the conversion data very high. So data risk management professionals are now starting to see this as a major area in this, and again, it will become more and more important as we go forward with this. So let's look at our next one, which is the data quality causes and dimensions in here, and again this is an area that most people are not really familiar with. There are two distinct activities that support data quality, and data quality best practices depend on both practice-oriented activities as well as structure-oriented activities, and we'll talk about those for just a minute, because if we don't address both of these, we will not achieve data quality. If you're in an organization or planning to be in an organization that is looking at this, make certain that whoever you have at your organization has in fact taken on both of these activities, because if they don't, you will only fix part of your problem. Again, fix only the data quality and not correcting the upstream problem just to give you something like this. So our practice-oriented causes here, these are failures to capture and manipulate data. It may be something like edit masking, range checking, CRs checking and transmitted data. Again, if you're not familiar with these, we can certainly come back to them at the Q&A session. They affect also data quality, excuse me, data quality and data representation quality. We'll get into those pieces in just a minute. But presenting the data out of sequence is giving you imprecise data or things like that. And they're diagnosed in a bottom-up manner where we find what's error process and we address it by looking at data handling governance. When people have met the data, we need to give them some idea of what's going on in there. The structure-oriented activities, however, occur because the data and the metadata has been arranged in perfectly. And this means that even if you do a terrific job with your practice-oriented activities, if your structure is flawed, you will not be able to get data that is fit for use. These are when, for example, the data is in the system, but we just can't get to it, where we have a correct data value provided as the wrong response to a query or when data is simply unavailable or inaccessible in this area. And the reason we have this challenge is because developers, when they're building, focus within system boundaries instead of within organizational boundaries or across boundaries. These structure-oriented activities affect the data model quality and the data architecture quality. And at this point, you usually get a question like, well, I've got packages. Don't the packages take care of this for you? So the last three projects that we've been on at Data Blueprint, our customers have purchased packages. And we've asked them, and I'll tell you one story just very briefly on this, the customer said, well, we've got a new package coming. It's going to run some of our laboratory software. And we said, well, why don't you ask the vendor for a copy of the data model of the software that comes in? Now, here's the interesting part. And the answer we got back yet was the data model's not stable. So we can't ship it to you. I hope you're laughing at this point. It is not a good prognosis for the software. And we've left them with a very careful set of instructions that said, here's your business practices. Here's the architectural components that are going to be affected by this software. You need to check this stuff out before you put it in place because it's not going to allow you to do your business if it doesn't work. And guess what? It didn't work out so well. So we then went and worked with the vendor of the software and fixed the data problem for them. So we had another customer that came out of it and we were able to prevent that customer from going further with this. The activities include things like when somebody puts a package on an RFP for you and says, we think we can solve a business problem with this particular software package. Go to your purchasing people and say, hey, I think it's a good idea if these people gave us a data model as part of the evaluation package, so we can see whether this data model fits within our existing business practices, whether it complements what we're trying to do. Again, what we're going to do here from data quality is look at a combination of practice-related and structure-related pieces. The related pieces affect the data quality values and the representation, and the structure-related are the quality of the models and the quality of the architecture. Those are a very specific example of this. The full disclaimer on this, SunTrust has been data blueprints bank since its inception. It's been 14 years now. We have a terrific relationship with these guys and we really like working with them. It did an interesting letter from them one day. That's up. We've given one of your employees a gift card. So we got on a telephone with this particular person and said, oh, okay, a gift card. What can we use the gift card for? Well, you can use it to go buy anything at all. We said, oh, like what? And they said, well, you know, can we buy a card with it? At that point, the person looked down and said, oh, my goodness, did we really send you a gift card with a zero balance? And we said, why? I said it was just a representation problem and there was a billion dollars on the card and it just was an overflow problem. You can see what happened here. The bank didn't know they had made an error. Again, a limit-checking tool was very, very useful under these circumstances, but tools alone would not have prevented the problem. More importantly, any other customer besides data blueprint might have lost confidence in the banks of the bank to manage those particular funds. So again, we had a little bit of fun with them on that one, but it's not an unusual or a completely unknown occurrence as we're looking at these things. Different changes, different ways of approaching it requires a more holistic solution to do this. Now let's move into some quality dimensions here. I've already mentioned them briefly. Data representation, data model, and data architecture and again grouped into practice and structure-oriented pieces of it. You can see here from this chart that most quality engineering has been focused on the operational problem direction, the direction towards practice-oriented imperfections, as opposed to looking at the structures of this. Now you can see on this particular chart, on the left-hand side we are closer to the user, which is what the user sees, the representation quality, but we also have to pay attention to things that are closer to the architect. Again, the architecture quality, the model quality. Another piece of this diagram, too, is that one data architecture, that is organizational assets, spawns multiple data models, are understood by the specific individual developers. Whether you're building software yourself or buying it off the street, it still is implemented that way. And each model represents one or more data value quality, these are the things that are maintained by the system, and then each value can have multiple representations when they are presented to the user. So this is a much more complicated picture. It's not a ridiculously hard picture, but anybody that's trying to address just one aspect of this is missing out very, very significantly on this. And I'm going to end in that same picture here to show you the actual dimensions and the attributes of quality at each state. So this is the full set of data quality attributes here. I'm not going to walk through each of them. And you shouldn't either. What you should be doing is saying, in order to determine what is fit for use of our organization, we were going to say, this, and we can make some specific attributes and use some utility in there in order to come up with us. Because if we try to do everything and get for perfect data 100% of the time, it doesn't work. And again, the other challenge with this is that most vendors, most organizations, most approaches to data quality are focused on the left-hand side there. And as you see, that if you're working at that left-hand side on just the data representation quality without fixing the value of the context of the model and the architecture, it's kind of like sitting at the bottom of Niagara Falls there in that little boat that's on the right side of that ground. So what are quality problems? It's an awkward waste of time and money. And we have customer after customer that we've worked with that have to do this. Of course you could wait until the falls freeze over, but that's probably a good idea either. Global warming and everything else that goes on. Maybe a while before we see Niagara Falls freeze over again. Now I'll take you to another interesting story here. This is a story from New York City. And they have approximately two-and-a-half million trees in New York City. And I remember there's only about 12 million people. So that's a fairly good portion-to-tree ratio. And trees are kind of a problem in New York City. 11 months from 09 to 10, they had three people killed or seriously injured in Central Park alone who had trouble with this. So there's an interesting belief that the people in New York City and arborists everywhere, that is that pruning and maintaining trees can keep them healthier and make them more likely to withstand a storm, decrease in likelihood of property damages, injuries, and deaths. Until recently, virtually no research or data to back it up. So we're not going to have a theory, but it helps a whole lot better if you have some data. So they took a look at this and said, wow, let's have a question. So pruning the trees in a single year reduced the number of hazardous tree conditions in the following year. And it turned out interestingly enough this brought in all aspects of our problem. They had the wrong architecture to solve this. So pruning was at the level of granularity. For example, the pruning data that they had in New York City was recorded block by block, but the current data was only recorded at the actual address label. So in other words, I might call them at 501 Park Avenue and I have a tree down in front. They need to come get me and oh, by the way, trees aren't grown with unique identifiers. So that's the problem as well. So after downloading, cleaning, analyzing, and intensive modeling, they found out that pruning trees for certain types of hazards caused a 22% reduction in the number of times the department had to send a crew for emergency cleanup. So some evidence of the tree actually does result in significant savings of land and property as well as costs to the people of New York City for this. Now, the best data analysis generates further questions. And since New York can't prune each block every year, that's just an impossibility, they're now working on building what's called a block risk profile, looking at the number of trees, the types of trees, and whether the block is in a flood zone or a storm zone, in order to come along and address this. And here you see examples of data architecture quality, data quality that get down to the value quality. I did find a representation quality in this particular example. Sometimes it's hard to get them all in one. I hope you found that one kind of useful in there looking at it. So let's look at data quality and the data quality life cycle in this. And I'm going to give you another bank example here. Again, another good bank that we've been friends with for a while. And Bank 1 back a couple of years ago. Let's see if this is the trigger with what this was. Wow, look at this, an undead letter. Wow, okay, interesting for starters. So I got this letter, and if you look carefully at your screen there, you can see the letter is literally crumpled. And the reason it's crumpled is because when I got this letter, I looked at it and said, I'm not a Chase customer. I do it in the trash can. This letter had only a Chase envelope on it. It did not have the Bank 1 logo, and I was a Bank 1 customer on this. And if you look at the letter carefully, it's got some little text in there that says, please be on the lookout for any upcoming communications from e-mails or Bank 1 regarding your Bank 1 credit card and any other Bank 1 product you may have. In other words, I initially discarded the letter even though I was a Bank 1 customer. I became a little bit upset after reading it, and it did tell me that Chase had some data quality challenges. So we're popping up onto another polling question here for you, but let me just talk about this for a minute. So the goal of this was to use up all of the old Chase letterhead. Literally, I did take this letter and throw it in the trash can. I saw it on the news later on that week that Chase sent it, and I said, oh, my goodness, I think it was from Chase. And I went back to the trash can, smoothed the letter out and looked at it and said, oh, wow. And then, of course, and again, it says just twice in there, please continue to open your mail from either Chase or Bank 1, and then at the very bottom you can see the PS as well, PSB on the lookout for any upcoming communication. For me, there's your Bank 1. Regarding your Bank 1 credit card or any other Bank 1 product that you may have. In other words, they don't know their customers very well, and they were making me do work that they should be doing, which is telling me, hey, we bought your credit card, so we should at least use your logo on it so that I'm aware. Again, I see this as a data quality challenge. Some people like to disagree with me on this, and I'll be happy to take those on during the Q&A session on this, but let's go to the polling question here. Does your organization utilize a structured or formal approach to information quality? The answers we're looking for here are yes, they say that they aren't, or no. And again, we get some very interesting responses on this. So again, we'll give you the .7 minutes, optimal response time here, because we use data to manage these things, don't we, Elaine? That's right. Actually, that does not number, but still likes data, too. That's right. Now, after the diversity, how could she not like to start pulling my iPhone before we won't have time on this one, but let's just see if I can find the jeopardy theme music for a good time. The poll results with everyone. Up there we go. Here we go. Okay. Yes, 17%, 22% say they are, but they aren't, and thank you for that. Obviously, there's no attribution here. 30% are not, and they're still getting 30% that are not answering. So I wonder if we've got a little technical glitch that may be going on there. We'll take a look at that as well. Now, moving up to the home cycle here. I mentioned Tom Redman before. Tom's been a great friend for many years, and this is just the evolution in our collective thinking that we've had. This first book that he published on data quality in 1993 gave a very natural feeling approach to what a life cycle looked like for data. It was data acquisition activities, and we had storage, and then we had data usage activities, and that's terrific for 1993, but here we are in 2013, and let's talk about how the data life cycle really works in here. We really need to start out with metadata creation because we need a place to put the data, and that's an acknowledgement that data, architecture, and models have to be used to create a structure for the data. This structure is then populated with the data models in the storage location in an activity phase called data creation. To store the data values someplace where we need them to be, data at that point can be utilized, and we might have a data utilization process. We might actually take some of those data values and manipulate them, and perhaps even restore those data values into there. Those data values can also be assessed. We'll talk about that in just a minute. We can also, after we assess them, manipulate them further, we can refine the data values, which particularly focus on data value defects that occur in there. Finally, if we decide that they are structural defects, we need to go back up and refine the metadata that's in there. This is a more complicated data life cycle, and I'm going to do it one more time for you here to let you see now the architecture refinements that have to occur, and the model quality is a little bit more complicated as we go through this. The model quality, the data value, the data representation, and again, each of these phases that you're in requires a different approach to doing your data quality resolution on this. These now have challenges for us that we need to pay attention to. So again, if you see somebody coming into your organization and simply trying to address one aspect of this, it cannot happen. Now, people look at this chart, and they really don't like it, it's kind of messy, so they said, make it a circle. So we did look at it and tried to make it a circle. Here you can notice there's two points to this particular circle. If you're in the upper left-hand corner for new system development, you start out with your metadata creation. If you're in the bottom right-hand corner, that's the starting point for existing systems, this phased approach to looking at data quality and determining where you are and which tools you should use as you're moving towards this. Now, just before we got started for our third polling question, the way I get my clip art for these polling questions is I take the text of the question. So the text of the question here is, do you use meta models, modeling tools, or profiling to support your information quality efforts? And I plug it into images.google.com, and that's the image that came up. I'm not sure what it is or who that is, but that's what came up. But we're interested here to see whether you're using these metadata tools, modeling tools, or profiling in order to support this, because again, we find most organizations are simply not aware tools, and if you don't know what you don't know, it becomes very, very difficult to do a complete and comprehensive job of paying attention to these. Again, that .7 minutes to respond on this thing. I don't think I remember Griffith unless I've played more than that. Interesting, he said that he made more money on that song than any other thing that he did in his lifetime. Very, very interesting. What do we have for results, Eileen? One second. There we go. Okay, we're seeing increases here. This is good. Again, some of you aren't answering, we're not being clear about the questions, but more are saying yes. And that's terrific. We're glad to see that's happening. Over time, that number has been improving gradually. Again, we'll keep doing this poll so that we can find out what's happening over all over time on this. For participation, let's look at some data called pools here. And again, what's happening here is there's sort of two different approaches, a bottom up versus a top down. They don't have to be done in an absolute dichotomy on this, but it is important to at least consider whether you're going to start bottom up, which is usually without management approval or whether something bad has happened organizationally, in which case you may be going top down on these things. Top down is the actual inspection of the data sets and trying to find specific issues based on automation, so they're happening fast. And top down is the business users are telling us they're unable to do the kinds of analyses that they need to do with these things, and they need to understand how their business processes consume and produce data elements. And this also helps us to identify which data elements are more crucial to the business success. Once again, you want to start looking at specific measures. Most everybody calls these metrics. Metrics is not a very useful word, so they're really measures. And again, that's one of the areas for critical business impact. Identify the specific pendant data elements that create and update processes associated with business impact. Those data requirements specify the quality dimensions or business rules that are relative to that. Describe a process for measuring conformance and get to an acceptable threshold in this area. These are the areas to do, and this leads you then to the ability to set and evaluate various data quality service levels. So again, as you're looking at these SLAs, I'll tell you a quick little story on this. I'm working with a, basically, business model is to sell magazine subscriptions. And when they got the magazine subscription information, the place they were sending the physical magazine to the customer, they'd use that information to sell other stuff into them. I want to say kind of a time-life-like thing. It was not time-life, but you know how to buy something from time-life. They've all of a sudden got lots of other things that you might be interested in to sell you in this. Well, if they're coming into them, they outsourced it, it turned out to be only about 30% accurate. It was a big problem back then. We'll forget about the fact that the paper-based magazine subscription seems to be something that is not long for the world either. Now, return to the proposal agreements. It talks about which data elements are covered by the agreements, what impact is associated with the specific laws, which data quality dimensions are associated with each element. And then what are the expectations around these so that we can put in place measures that allow us to succeed and understand so that we can measure, monitor, and manage these things. We can look at an in-stream while we're collecting it as we go through with it, or in batch mode where we go in periodically and try to do it. We probably need to determine which is the three levels of granularity we're going to apply the measurements again. Again, it might be at the element value. It might be at the instance or the record level, or it might be to the data set as a whole. Each of these are very important decisions for you as you're working your way through this. The last 10 minutes here now talking about some data quality tools. There are categories of activities that we have to do with analyzing the data, enhancing it, and monitoring the data quality that you have in your organization. And the principle tool sets there are six different ones, data profiling, parsing and standardization, transformation, identity resolution, and mapping, and enhancement, and reporting. Let's start in order. The first one is a set that I was involved in early on at the Department of Defense. As we started to do this, I spent a large part of my time working in the basement of the Pentagon, working on data architecture and data quality type problems. And we realize that these were not going to work in a simple manual mode. There was just a lot too much on human intervention for these things. So we developed a series of grants that we sent out to the universities, and it turned out that Columbia University, a woman named Dina Bitten, wonderful PhD, was able to develop the basics for this process called data profiling. Now, most people have at least heard of this at this point in time. The algorithms allow us to go through and do statistical analysis, assessment of values within a specific data set, but then also to retain that metadata knowledge as we incorporate additional data quality and also additional data sets in there as well. So we can look and see what happens. Now, the other part that we were addressing in this from the deep perspective was the subject matter experts that we needed to have access to. Again, if we were working in a domain like healthcare, for example, the data quality people were not familiar with with healthcare. So we had to use large amounts of time of these SMEs, these subject matter experts. And these algorithms that Dina created allowed us to make inferences about the data, literally an inference engine. And when we had inferences in place, we could then go forward and say, what's actually happening? And we would go to the users and say, not tell us about your business, which gives you a blank screen and something that is typically telling for most people, but instead, we could say, we believe this is happening. Can you confirm or deny our hypothesis? And they were much happier to edit our hypotheses and our findings than they were to, in fact, query them for us. This allowed us to understand both the semantic and the logical levels. In fact, with data profiling tools, you can take any data set anywhere and develop a logical third normal form of that data set. So if you have a software package that you don't understand what's inside, you can look at the data going into this and look at these rules. By the way, you do not need to buy a tool, although the tools are now rentable. You can rent them on the web or you can rent them in software service delivery or vendors will do it that way. But you can also apply your own, you can do it by yourself. When I teach this in class, usually a student will come through and say, you know, I could run a SQL. And the answer is yes, you could. SQL, however, does not have the memory. You have to add some additional components into it so that you can then go across data sets in order to do this. So again, with your profiling, you can look at this and say, there are some things that are happening. You can give an alert to the help desk or you can take your business analysis and focus it more specifically on a various aspect, a specific aspect of the problem. From a company, we work with a lot called Global IDs. And I'll thank Arca for this particular slide because it just shows a process that his software can go through. He's got one of many tools that are out there, but it's a fairly comprehensive one here. And you'll notice at the start, they discover all the tables for each table. They discover the metadata for each schema. They discover the relationships. And for each schema point, they go in and get a metadata report moving on the column. They go through and profile each column. And then for each schema, they double check the numbers for all columns. You can see this is a lot of work. And it turns out that Arca has recently reconfigured the engine to take care of Hadoop, which as you know is a big data technique so that he can do a lot of this in parallel. I think he's got a fairly significant advantage out there. I don't want to give him too much of a commercial, but it is stuff that we use a lot and our customers are finding a lot of value from. Data quality number two is parsing and standardization. And again, this is the idea that we're going to look for specific patterns. You may have seen this in address normalization. The post office does a terrific job of trying to identify these things. So if we don't end up with things like street and street, what do we have as the standard definition in order to do this? You've probably also seen it where you have an order online and the order online comes back and says, just putting your zip code and I'll figure the rest of it out for you. Or it may ask you a question of which one of these two counties are you in if the zip code happens to go across multiple counties. This is information. If column one then make the value male, female, we know that's the wrong thing to do. We in fact want to say test for each value. So if F then female, if I then indeterminate. And again, there's actually none of this that you can look at in here. But these data transformation tools allow us to go in and map the original values onto what sort of target representation we'd like to have. So we can now do this with a rules based engine. Identifying mapping, another set of tools here, which is to look at whether it is deterministic, relying on the patterns for assigning weights and becoming predictable, or whether it's probabilistic, as in there are 52% women in the United States and 48% men. So therefore it's a more probabilistic that is a female rather than a male in that particular example. And one of the data, again, adding additional metadata onto things where we can add timestamps, auditing components. You've all heard probably of the fact that your smart phones now take pictures and geocode stamp your picture. So if you take a picture of somebody and post it on Facebook, somebody else can take that picture and figure out what part of the country you live in, what part of the country you are taking that picture and whether you are in fact at home or not. That's a little bit scary, but we should be a little bit careful about what's going on in the area as well. Fixed tool, again, class of reporting tools here that just allow us to go in and see what's going on from an overall perspective, whether or not these service level agreements are working and how it's working in our organization here. So that's the set of tools real quick. Let's just take a couple of takeaways in the last couple seconds that we have here. Again, what we're trying to do with data quality engineering is to develop and promote awareness to get people into developing requirements, to learn how to process and analyze the data, to look at measures and metrics, business rules, quality requirements, service levels, learn how to monitor our data quality, manage the data quality challenges, learn how to correct, clean, and document the rules that we're finding out, implement the various performances that we find out because this feeds us back into requirements for new systems, inspection policies for existing ones, and changes that go into this, which means we really have to understand a way of recording the expectation in the business rules a way to measure the quality within that dimension and understand what are acceptable thresholds so that we can get to fitness of use. So we have about an hour here walking through this data quality engineering portion of our area. I've included some additional information here for you just so that you know what's coming along, and I'll show it to you so that you can see this. These are the dimensions of data quality here. So this is the quality dimensions here of the value quality, the representation quality that occurs, the model quality, and the attribute quality. So that's just a little bit of reference information there for you. And with that, we are just about at the top of the hour, and I will turn it back over to Eileen so she can talk to us about what sort of questions you guys have. Great. That was great. Presentation. And thank you, everyone, now it's time for our Q&A. So if you can't currently see the Q&A panel, just raise your mouse to the top of the screen, and you should be able to get it back that way. We do have some question and comments coming throughout the presentation, so I'll just start with that one. Our first question actually sparks a few comments. Let me read it back to you. And if you want to know what happened, Peter, it was, does your organization address or plan to address data information quality issues? And some people say, well, one of the questions that we should have included was not at the above, I guess we were just trying to be optimistic that something is being done, that it is being recognized, and then also somebody pointed out what about ongoing as an option, shouldn't this be an ongoing issue or an ongoing something that we address on an ongoing basis? Absolutely valid points. And just to respond, this is an interesting component. When we did the survey the first time, we did it wrong. In this fashion, gosh, the first one we did was about 15 years ago. And so we've kept that error in there all the way so we can compare like results, but you can see the importance of doing this. So again, thank you for pointing that out. That certainly is a data quality error, and unfortunately, we're stuck with that one for a while. Okay, we had another question on the section where you're talking about how there's two distinct activities that support quality data, and there's a slide on the practice-oriented activities. One of the bullets was CRC, checking off transmitted data, and what exactly is the... Sure, great. Let me get to that slide here. So, the practice-oriented activity, and it is possible to build an algorithm by sending a dataset, whether it is a single file or a group of files, that we can build what are called redundancy checking into them. We add across the row, and the answer should always be odd or even, for example, or something along those lines. It's a lower level of security than actually a signed... a signed email, but it is something that you can build in so that if somebody gets this row of data or a collection of data and they can see whether or not it was, in fact, transmitted properly, whether a transmission error has occurred in that translation of the data. So, CRC checking is a very standard engineering technique for using many, many organizations, particularly as you're transferring stuff to or from the cloud, use this to make sure that what you get is, in fact, what you were supposed to get. It doesn't tell you about the quality of the stuff in there, but it does tell you that nothing got changed during the transmission of that data. So, I hope that's helpful. If you haven't seen that before, it's certainly something in particular if you're doing a lot of data transmission around your organization, or you happen to be working under adverse conditions where transmission errors often occur. Question. Could you clarify the term of metadata? When does data become metadata? So, that's a great question, and it happens to lead into our next presentation we're going to do next month, which is on metadata. So, you can learn a whole bunch more on it next month, but data and metadata are terms that are rightly often confused because they're improperly used, and I'll get into this a little bit more with the next webinar in here, but I'll give you the Gartner definition, which is a very good one. Gartner defines metadata as anything that adds value to data. Adding value to data, metadata therefore requires management and management attention. There's a slide on me right now, and I apologize if I said that quickly. We will put that into the next webinar. But let's just go back to my slide on 42. When we talk about 42, some of you may also think of the movie 42 that is out right now. Some of you who are already thinking 42, that was something from Douglas Adams, Hitchhiker's Guide to the Galaxy. It's actually the meaning of life. So each of those interpretations, my age, 12 years ago, the meaning of life, or the football movie about, I think it's Jackie Robinson, 42 was the movie about Jackie Robinson. That message tells you what we're referring to with that 42, as value. In other words, if I said let's go see 42, and you don't happen to like tongue-in-cheek science fiction, you probably don't want to go see the Hitchhiker's Guide to the Galaxy movie, but you may in fact want to go see the movie about Jackie Robinson. Somebody correct if I don't have that term about Jackie Robinson. I can't get to my screen right at the moment here. But anyway, so metadata is this additional piece that goes on and adds something to it. Again, if I'm trying to find out whether the Hitchhiker is old enough to buy alcohol, in Richard and Virginia, the metadata that you used to do this was whether we had facial hair or not. And if you had facial hair, a mustache, a beard, a goatee, and all those lines, you were probably over 18, which was the age we could buy liquor at when I was that age, and you were allowed to have it. So people would actually put fake money on and things like that. It's because that was the metadata that you needed to have in order to be able to buy beer. I hope that's an answer for you on that metadata. It's something that adds value to data. Oh, by the way, let me give you one other piece on this too, which is a sort of pretty upcoming webinar on this. But the question is whether you're looking at something and whether or not it's metadata. That's the wrong question to ask, because anybody's data can be anybody else's metadata. Metadata really involves a transference of a level of abstraction around something, and that gets a little bit tricky on that, and it's not really helpful. The question becomes, should we include this item or the scope of our metadata practices? And that's a very useful question to ask, because there's a value that you can obtain from that. In fact, look at this. Let's talk about that for another quick second here, though, because there's been a lot of talk of metadata this week. Heather isn't there. Also, you know, excuse me, NFA and domestic observance of this, and the partner came out the other day and said, look, we're not looking at what your telephone conversations were. We're simply looking at the fact that you had a conversation. And by the way, it's all legal under the law today. That's very, very clear that this is legal. But again, I'll take an example here that somebody put at the Electronic Feature Foundation just recently. If I'm making a phone call with a suicide prevention hotline and the origin of the call was the middle of the Golden Gate Bridge, we can learn some stuff about what the data of that call was about. So we can infer some things about it. Similarly, too, if I happen to call a 900 number and have an 18-minute conversation and that 900 number sells delightful losses, that may be something that we can infer from that as well. So again, metadata is a key piece to pay attention to. And by the way, if you have poor quality data, you're certainly not going to be able to have quality data. So it does give it to another piece but I hope that little five minutes or a little que on, may be helpful to you. If not, please ask another question. Question. As you're becoming informatics, data mining, data engineering, or major, what is the best piece of advice you could give me? First of all, if you're interested in this, we do find that when we study this as a career, it usually takes people about 10 years of IT work before they decide that this stuff is important. Congratulations to you for discovering this early on. What I would do is to see how much practical experience that you can get. And I would also, one other piece of advice too, which is to, while the booklet is good, there's nothing that substitutes for the real actual learning that occurs. Again, I'll give you back to my blind man-in-the-effort scenario here where we look at this and trying to figure out what is metadata. I was going to say metadata, what is data quality? Because it ends up being different things to different people as we look at it. There we go. There's my blind man-in-the-effort slide. And we can talk to things from college and university, and we've been working on that for a long time, but not make you into a person who's able to do this. The soft skills that you need to have come from the real-world experience. And so if you're able to volunteer and by the way, I don't imagine any operation in the world is going to object if you walk in there as a person who says, you know, I want to do a job for you guys, but I'd also like to help you guys improve the quality of your data. When people do that, most of the time people say, pull up a desk, give them a fat computer and let them get started. I said, man, boy, that's terrible. I apologize. It could have been a female as well, and hopefully you'll stick me on. There's a quality problem right there. So there's some experiences you can. Almost all universities will let you have what we call an independent study, an internship of some sort in order to do this. And if you have trouble with that, reach out and give us a shout. We can maybe arrange for something for you as well. I'll talk to you in your career. Oh, let me give you one other piece of advice too. I know you only asked for one, but one of the things going badly in IT, we call them a science project. And I'm really kind of afraid that people are calling these data scientists things without understanding the context for it. So I'm not sure I would go and label myself a data scientist in that context. I think we've got to do a little bit more work on our career fields and labels before we get into that. But certainly, if you walk into organizations and say, I'd like to work on data quality problems, most of them will don't have enough time and effort to do that, and they will be very grateful to you for paying time and attention to this. I can also recommend a reading list to you. You can see it at the end of the presentation here. But please do reach out. I'd love to find out who that is. The complexity of the processes of data quality what is the fastest way to become conversant with them? Can best practices? You read the book, or is it the devil in the details? For example, data quality is 80% of their work regardless. That's a great question. One of the best here is that when I was with the Department of Defense, we were starting in 1992 to work with the Y2K problem. Some of you may not remember or be familiar with the Y2K problem, but it was the computer problem that was going to take all of the dates that were only represented by a two-digit number, as in today would be 13. And if we wanted to find out what happened two years ago, we would subtract 11 from 13 and get two. But of course, if you subtracted my birth year, 59, from 13, you'd end up with a very different number and, in fact, an incorrect number. Some have called it the biggest data quality problem that we've actually successfully addressed on that. And it had similar complexities in that. So one of the things we did, and I was working with a fellow named Rush Richard, who was just a terrific guy to work with, was a member of me for a couple of years in there. Rush did as he put it all in context. So we developed essentially a flow of how data was coming along throughout the Department of Defense. And this model allowed us to go in and see where we could be most effective in doing this. So again, I'll relate you to the structure that I used earlier on from Tom Redman in there. Let's see, that's the slide here. And data quality is just in the lake. The data quality challenge is just in the lake. You're at least informed enough to be able to say, hey, let's work on just the lake. But if you don't know, and, in fact, be something larger than that, then we maybe need to go upstream and don't look at the lake because the data doesn't stick around in the lake for very, very long. So getting this contextual information will help, again, about what type of problem that you're facing here. And then what type of tools you can use. And again, remember, we've covered a lot in here. Are we looking at what dimension of data quality? Are we looking at what phase of the life cycle? What set of data quality attributes are we looking to? What type of data tools are we looking at? And finally, where in the context do we apply engineering and architectural concepts and where should we apply human beings to that? All of those are very complicated, but our customers are not confused when we walk into them and they come to us and say, we've got a data quality problem and we present these kinds of things. And it takes them a little time, about as long as you guys have spent on this, before they're grounding in this and then say, okay, now, here's an area we think would be useful. Here's a tool that looks useful. Let's get started and see if it actually works. Again, I hope that answers your question. Next question. If NASA has well-built preventive controls to ensure a clean river, is there still a business case to build and monitor data quality on an ongoing basis? That's for you, isn't it? So in other words, if you have in place things that keep water from going into the lake for being polluted and clean, it sounds to me like you've done an excellent job and should write a book about it because we certainly can use some guidance in those areas and some success stories in those areas. And if you think you've gotten them all and you haven't tested for it, I would advise a little bit of testing before I'd start yelling about success. Again, their data quality problems can be quite insidious and get into things that you just don't expect them to be. Again, like the blind man and the elephant piece there. So congratulations to you if you've gotten there. If not, double-check it yourself every which way you can. And then tell them that the data seems to be currently in place and also demonstrate the value of that because it certainly didn't cost a lot of time and money to prevent those preventative things. If you've been successful in this area, there's clearly a demonstrated business value there. And I would put some bottom-up so that people understand that by investing X, you are clearly saving your organization X plus something in order to come up with it. Again, you're doing the tracking business value. That's the core of our data blueprint philosophy here. What I think is around some tools. Can you name some metadata tools? Metadata tools? I sure could. But, you know, again, we could look at Platinum Repository or Roche Metadata Tool or the True TRUX Metadata Repository. But I'm not sure naming them is going to be helpful to you. Again, what you're trying to do is to figure out what you're trying to accomplish. And each of those tools have their own strengths and weaknesses. We'll talk more about them next time through, but I'm sure you have a more detailed question than that, so I'll name some. Let's see if maybe they comment and provide a little more detail. Next time, we can move on to the next. You've expressed that it's not possible to address everything at once. With limited resources, where should one start? What is the most important aspect to focus on? Again, let's go back to context. So, great question, and everybody has the resources that they need to get all the data perfect in their entire organization, unless you're dealing with a relatively confined set. The real question is, what is the biggest area of risk in your organization? And if you don't know that, you can probably have a conversation with your chief-level investors in the organization. And I'll put in a little plug from my book here as well, because that is one of the reasons we don't see a lot of this knowledge existing at the C-levels of organization. It's not that people are bad or that people aren't smart, but people don't know what they don't know. And so, the quality in charge of data as an asset in your organization. I guarantee you, if you go to your chief financial officer, what is the biggest financial risk facing the organization? They will be able to list their top three things to keep them up at night. If you go to your CIO and ask them about technology risk, they will be able to tell you the top three things to keep them up at night. I don't think anybody in most organizations thinks of data as an asset that needs to be in the same manner as your cash or your HR or other types of organizational assets. So, I would actually go to the organization and say, what risks are you facing organizationally? And then backtrack from there and say, what then data, what role do you play in those risks? For example, I'll make up something here on the spot, but again, my university, Virginia Commonwealth University, has been building dormitories for a number of years. We all of a sudden have a package we offer the students that costs beyond the financial means of most of the students. Those dormitories are going to sit up open. Somebody else said that some of the big box retailers might as well be turned into sales at some point there because, again, the risk of customers not being able to get access to the transportation needs that they have. So, I don't know the organization, I don't know where you're asking from, but what are the things that are keeping the executives up at night? What are the things that cause them to worry? And then look at how data plays a role in those worries. And more importantly, grab a copy of my book, it's not for you, it's for your boss's boss, but get them to read it so that they will understand that data has our sole, non-depreciable, non-depletable, durable, strategic asset, is something that we should pay attention to and make good ground. Thank you for your question. Some big data is on top of everyone's mind. Are the data quality concerns a barrier to develop big data projects? Good question. So, big data is about trade-offs. And I don't even like to use the term big data because I guess that means the rest of us are doing a little data. But big data techniques do offer us the ability to gain insight into things faster. They gave us two months ago called demystifying big data. And one of the things we talked about were the types of trade-offs. And if you look at the side of the big data techniques, we look at different types of things. And if you take a quick Google search on acid versus base, you will see some of these trade-offs that have papers written about them. Acid stands for Atomicity and Consistency types of things, which means you're going to get a correct answer. As in, do I have enough cash in my account to withdraw $100 without overdrawing and incurring penalties and fees? That's a very precise transaction. Big data is about eventual consistency. And so it's not the technique that you would want to use to tell me whether I could, in fact, withdraw $100, but that I need to be able to use big data techniques to say about $100 a week seems to be about the amount you can withdraw and still keep your bank account solvent. So there are different aspects of quality in big data techniques. And they are good. And again, technology is neutral on this, but if you don't understand the various trade-offs that are being made, it's going to be very, very difficult to apply. You would not apply, for example, the price representation quality of the data because big data is much more about patterns and fuzziness than it is necessarily about specificity. The question is certainly a good one. Question is regarding data governance and how you can use it for quality. Do you have a standard best practice, best practices in terms of data governance? I was looking for the other half of a question on that one. So there are some best practices around data governance. And gosh, what a great time to make a plug because Eileen and I are getting on airplanes next week and we're going from Richmond to San Diego where we're going to be hosting the data quality and information governance conference that we do out there once a year. And there will be a bunch of best practices presented there. So if it's not too late to hop on an airplane, I think that DeVita told me over 300 people are going to be attending there. It should be a really exciting event. I will be speaking on an aspect of governance that is relevant to your question, which my title, which I'm doing with a customer of ours, Michael Matura, great trends have become over the years. It's called a three-legged stool. And one of the lessons that his organization has learned is that just buying the technology alone was insufficient to achieve the quality that their organizational business practices depended upon. The other two legs of the stool are people and processes. So buying technology alone is insufficient because it doesn't give you the architectural support that you need to have. You also need to include in it, again, the right techniques of people with the knowledge, skills, and abilities, and some of you have been on this call clearly. But in addition to that, processes around here as well. So I guess a guidance practice, I can give you a couple of things that will be helpful. First of all, the governance group should be reporting to the business and not to IT. That's a little argument that I make in the book, but IT does not feel the pain of data quality errors. The business feels the quality of that pain. So to reduce the communication in there, we definitely want to have the data governance group reporting into the business. In fact, the chief officer should be reporting in at the same level that the chief information technology officer reports into as well. Another component of that, though, in addition, is that the language of data governance is all metadata. Data is what will control the various service levels that you're going to be asking for and things like that in your organization as well. Again, each of these are going to be important. Obviously, we won't be covering a whole lot of metadata today, but we certainly will get to issues as we come up in the next session as well. So, again, the language of data governance is metadata. If you're not speaking metadata in there, please make sure that your data governance group is, in fact, speaking metadata, because if they're not, they're losing the opportunity to really get some good value out of what's happening there. And, of course, additional translations in there which are problematic all the way around. Do you have all the parts of that question, Irene? I think so. Yep. So let's do the next one, because we still have plenty of them. Do you expect data quality as a service that is the automation office process to ever become possible? Well, it's certainly possible through service-level agreements now, and it's also possible to use data quality tools that are cloud-based, which is software as a service. So I think the short answer to your question is yes, but just like we've learned without sourcing, the hard part about this is not the technology part about it. The hard part about this is getting the process correct. So, again, we've had organizations that have outsourced different aspects of their business, and they set over to some very fine folks that are in a different part of the world who work on it and correctly implement the things they were asked to do, but it still doesn't meet the business needs because, unfortunately, one group is participating at a level of maturity or capability maturity, CMM, than the other group. And those, even though they speak the same language and they both speak the same vocabulary, they still miscommunicate in many, many instances, and it's been proven all the way around. So, absolutely, I think it's a challenge. The best way is to quantify costs associated with poor data quality. Oh, great. And that's the topic of my book coming out this fall called Monetizing Data Management. So I'll give you one example here in this. It's just one example. There are many, many more. And actually, I think we do a monetizing talk later on this year, don't we, Eileen? Yes. She'll figure when that's going to be. But again, I'll go back to the example here that we had, which was to say that they had a data quality problem. They understood how they were going to have to fix it, and they thought that they were going to spend a lot of time manually fixing it. But hopefully, you saw from the example here that we were in this particular organization, literally millions of diverse and centuries of person work. Very good way to go about doing it. I'll give you another three-letter acronym to look in there. It's something we found very helpful. It's called Activity-Based Costing. And it's unfamiliar to you. Again, give us a shout. We can give you some white papers and things on that. I think we've got some things. Yeah, we did write up some things on this. And again, we'll have a whole book coming out on it in the fall. And I checked the webinar is going to be in October. So it might be right around right before the book comes out. Right before the book? Good talk. Yep. Nice. She's expert at what she does. Let's see. The next question. Analysis, cleansing, enhancement, and monitoring are data quality engineering activities, which you should focus on if you were in a small group with an enterprise focus and a large organization that hadn't embraced quality improvements as a whole. Bottom-up approach. You haven't gotten the executive mandate in order to do this. And you're trying to figure out how to make a difference to gain executive awareness on this. I continue to make it. I have yet to be proven wrong. And that is that I have not found an IT project that has gone bad that didn't have a data quality problem as its root cause. So I would look at your IT problems and challenges and hopefully you're not going to tell me that your IT projects go in on time within budget with full functionality because if that does, it knocks it down from under my argument. And I believe that you've poked around and it's just a little effort for you to find that the organization has a challenger. Let me give you an example from a bank that I worked with. Not too long ago. As we were working on a new system and rolling the new system out, we identified a data quality challenge that was a structural data quality challenge. They were using the wrong structure for a major component of their architecture. And we pointed to them and said, hey, this is a problem. It's not going to work the way you think it's going to work. And we'd recommend that you go live and that you in fact fix this particular problem. And they asked how much it was going to cost and we gave them an estimate and they looked at it and said, no, we're going to live. We don't think it's worth that to fix. Well, we were able to measure the overtime that their people had to put in over the next 18 weeks as they were fixing all of the problems that occurred from our little estimate. And it was clearly 10 times more in this particular instance than the estimate had to fix the problem doing it correctly. So we could have postponed the drive by a couple of weeks and some more money. And it would have saved them 18 weeks of pain and more importantly, 10 times the cost of fixing the error early. So that's a very easy way to take a look at this and figure out exactly what's going on in there. Again, if you look at your IT projects, I think you'll probably find a little more quick example here. We're getting close to the end of the session here. I don't know if Shannon can let us extend this or not because we do have some questions ratings still. But the other big government systems that we're working with has three different ERPs that are being implemented and each one is being implemented by a different cost and contracting group. And coordinating their data so consequently, we know that these three systems will not interoperate even though they are all the same brand X system. These are the kinds of things that we can identify in advance and not happen to them and change again and not come to a happy outcome in here. Again, if you're working on this area from a bottom-up perspective, get some visibility at the sea level, find out what are the risk areas that people are keeping up at night, and see if you can find some way of giving them some insight into how to reduce those risks. So it looks like we're right at, at least here on the east coast. So what we'll do is we still have some unanswered questions, but since you all submitted them through the chat, we do have the chat log. So we'll write some questions and include the unanswered ones on the chat log and then we'll do the answers. So you'll have a complete transcript of the Q&A in about two business days when we will get you the low-up materials through Data Aversity. So, thank you, everyone, for participating in today's event. It was a great, active session and we hope you've enjoyed it. Thanks again for Data Aversity and Shannon for hosting us. As I said, yes, you will receive today's materials for the next two business days. Just takes us a little time to put it together. Next month, the topic of the webinar is actually we're starting a three-part series on Data Systems Integration and Business Value and the first part of that one will focus on metadata. Hopefully you'll be able to join us for that as well. As always, feel free to contact us if you have any questions. Thanks everyone and have a great day. Eileen and Peter, another great presentation, like you said, Eileen, and just let everyone know we do have a new partnership with Morgan Kauffman, so I'll be sure to get you the discount code and the follow-up email as well for Peter's new book. So, thank you for attending. I hope everyone has a fantastic day.