 Hello and welcome. My name is Tendly Proudfoot and I am the Digital Production Assistant with DataVercity, stepping in for our Shannon Kemp, our Chief Digital Manager. We would like to thank you for joining today's DataVercity webinar, Data Quality Strategies, the latest installment in a monthly series called DataEd Online with Dr. Peter Akin, brought to you in partnership with Data Blueprint. Just a couple of notes to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. If you would like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the bottom center of your screen to select that feature. We will be collecting questions throughout the duration of the webinar and that can be located using the three dots in the bottom middle of your screen, or via Twitter using hashtag DataEd. To answer the most commonly asked question, as always, we will be sending a follow-up email to all registrants within two business days containing the links to the slides and the recording as well as the recordings of the session along with any additional information requested throughout the webinar. Now let me introduce you to our speaker for today, who is Dr. Peter Akin. Peter is an internationally recognized data management thought leader. Many of you already know him or have seen him at conferences worldwide. He has more than 30 years of experience and has received many awards for his outstanding contributions to the profession. Peter is also the founding director of Data Blueprint. He has written dozens of articles as well as 11 books. The most recent is Your Data Strategy. Peter has experienced with more than 500 data management practices in 20 countries and consistently named as a top data management expert. Some of the most important and largest organizations in the world have sought out his and Data Blueprint's expertise. Peter has spent multi-year immersions with groups as diverse as the U.S. Department of Defense, Deutsche Bank, Noikia, Wells Fargo, the Commonwealth of Virginia, and Walmart. And with that, let me turn everything over to Peter to get today's webinar started. So hello and welcome. The Data Architecture Summit in Chicago, where the rest of your team is as well, and we're having a great time downstairs. I had to duck out of that to do this, but we're going to talk specifically today about data quality, which is not on the program for the Data Architecture Summit, but nevertheless is an important topic. So today what we're going to look at is data quality in the context of our data management profession. I'll look at a specific definition for data quality, and I'd like to add the word engineering to the end of it because most people really don't understand that it is an engineering discipline and must be applied as an engineering discipline as opposed to strictly an attempt to clean up data. We'll talk specifically then about the data quality engineering cycle and a couple of contextual complications around the whole process. Look at data causes and data quality dimensions, which are not really widely understood, and then we'll look at quality in the data quality life cycle, finishing up about an hour from now with some toolkits. Then we'll get of course to the questions and answer period, which is really where we have a lot of fun on this. The idea is that you'll have a better idea as you're getting started on data quality journey of what actually needs to happen. And just to shortcut some of that discussion, one of the most important takeaways from this is that, yes, we need some tools to do this, but tools are not the place to start. It's really more of a people and process problem than it is a tool problem. Now, I start this particular session by saying that my spouse is a horse person and I'm what's called a horse husband. And what that means is that she has a big t-shirt that says I love my husband and then in fine print it says at the bottom almost as much as I love my horse. And you may be saying, why are we talking about this? Well, it turns out that part of the deal is we're going to build a barn together because we're both horse lovers on this. And the overall approach to building a barn is interesting when you look at it compared to how we do large complex IT systems. I took these pictures of the barn to prove one thing in particular. And that was that we had passed a foundation inspection. Now, the bank and loan me the money to build a barn actually gave me enough money to build the foundation, not the entire barn. And that's an interesting approach because the bank understands that without a good quality foundation, I may build a good barn on top of a poor quality foundation. And if I do that, I will spend more in vet bills for my horses than I will paying the bank its loan back. And my point in illustrating this is that there is no IT equivalent in our context here because most people don't really understand that data management is an awful lot like Maslow's hierarchy of needs. We could start out with our food, clothing, and shelter needs being unmet. If they are unmet, then we can never be safe. If we have food, clothing, and shelter needs that are met, it doesn't mean that we will be safe. But it means we can't be safe if those needs are unmet. Same thing with safety is a necessary but insufficient prerequisite to being part of something that is larger than ourselves. The concept of love and belonging family and things like this. And if you're not part of something bigger than yourself, of course, you really have trouble getting to know yourself. So you have your own identity as part of this larger piece. Moving on up the chain, each of these levels, orange, yellow, green, and then blue, are necessary but insufficient prerequisites to where we'd like to be, which is self-actualization. And I say that and spend a minute on it because data is an awful lot like that. There's a huge technology focus where people are trying to sell your organization all kinds of really good products that work very, very well. But that buying those products really is just the tip of the data, the data iceberg. And in this case, it requires foundational practices, one of which, of course, is data quality. These are capabilities. They are not technologies. They are people in process related issues. And consequently, it's very difficult to get people to not buy the stuff at the top because that's what's being advertised, but instead concentrate on the things down underneath, which are part of a discipline called engineering and architecture that is absolutely missing from most college university curriculum out around this. Now the idea is that if we do these foundational practices better, just like Maslow's hierarchy of needs, these are necessary but insufficient prerequisites for success. But of course, you can see that just starting with just the tools at the top is not going to lead to success ever. And we're always asked at data blueprint, can you do this faster for us? And the answer is yes, we can work faster, but if we do that, it will take longer, cost more, deliver less, and present greater risk to the organization than instead if you learn to crawl, walk, and run your way up to the top of that pyramid. These are the same five data quality practice areas that I had on the previous slide. And they have some definitions now. We're actually in pretty good shape. We can say that it is important to manage your data cohesively. Right now, your data is probably managed very well at the work group level. That is the defining characteristics of a work group. You can access a set of talent for people who know how to manage these assets professionally. We have data governors professionals that we've kicked out of the years. Quality, of course, is the idea of maintaining the data that is fit for purpose and building it in an effective and efficient manner. And of course, we have the platform and the architecture which says you're doing it with the right tools and the right processes. Of course, you need some supporting processes around this. And this is a part that most people are unaware of, but it is foundational to this type of discipline. There's a scale that we have out there that says you have one point for having a pulse, two points for having a repeatable process, three points for having a process that is documented, four points for measurements that occur within the process and five points. If you get together and look at those measurements and decide what should be reorganized in order to more fully optimize for your particular organization, this is the CMMI scale. Many are familiar with it. In fact, even if you're not familiar with it, I'm pretty sure your boss is familiar with it. The idea is that the capability maturity model has the best track record of providing process improvement around this. And of course, we're applying it to data here. So if you just say this is CMM for data, your boss will understand it. And I'm going to give you an example on this right now where I'm going to rate each of these four areas, data strategy, data governance, data platform data architecture at a level three. It just says we're doing repeatable work with documentation around it, but that the data quality area is a weak link in that chain. What it means is that the entire foundation of these practices can only be as strong as the weakest link and that they may put more money into data governance, which would be a reasonable thing to do, except that it won't produce any better results until they take the data quality piece and bring it up to the same level as the other draft. So unless you have threes all the way around, it's impossible to get a higher rating and more importantly, a higher performance. Just the same as I go back to the foundation of my barn that I showed you a minute ago. And if I had put the barn foundation as, you know, crack seashells and things like that, it would not be nearly as strong, as durable, and it would certainly not pass the county inspection and therefore the bank would never give me the rest of the money in order to do this. When we look at data management in here, the Denmark guide to the bottom of knowledge gives us a good idea of what happens here. This is a good description of it. It is actually good enough to criticize and that's perfectly reasonable. We're looking for improvement. Reminds me of George Fox's quote that some models, all models are wrong, but some models are useful. We think this is a useful one, but it's missing two important concepts, optionality and dependencies in here. We just don't show them, which means people look at this wheel and say, oh, I must do data warehousing in order to do this. No, what it actually says is data warehousing is a part of what can be done. Now, we've upgraded the model. We're now at the Denmark version two. You can still see, however, the data quality is right there at 11 o'clock where we needed to be starting with governance and then looking in at quality. These are two areas that would be important to take a look at. So, let's take a look from the Denmark one at an IPO diagram. IPO stands for Inputs, Process and Outputs. These are very nice articulations. We've found them useful over the years to talk specifically about this. I'm not going to read you every line on this chart, but this essentially is the map for what we're going to cover today. There are some specific inputs. There's some activities around this and it requires some systems thinking in order to do this properly. And then the outputs, of course, have to be accounted for so that we can properly ensure that we get in the right focus within our efforts on all of this. Let's take a look specifically data quality and data governance in context. We have a strategy. We have data governance and strategy, of course, is what the data assets do to support strategy and they should be expressed in business goals. Data governance is how well is that strategy working. The data governance language should be meta data in order to do that. And when we look at quality then, governance says that some aspects of data could benefit from improvement. I'm willing to bet that's probably true for all of your organizations that are listening on this and that the good quality function is going to provide evolutionary feedback about the current focus. Are we hitting the business objectives that we're looking for? So we ought to see data quality efforts led out of the data governance organization although they can't arise from the business. We'd like data governance to actually be part of the process of continuing to clean those up. So that's our contextual piece on all of this. Now let's get to a definition here. I'm going to give you a specific model. Some of you may have heard this before, but I asked people what does the number 42 mean? The answer is 42 is the meaning of life, the universe, and everything. Now that sounds insane until you realize that probably many people on this call here and when I speak worldwide there's always at least one person in the crowd who has read the book The Hitchhiker's Guide to the Galaxy. The, turns out, subplot of that book is that the white mice and the dolphins actually run in the world with us as the experiment and they'd like to find out what the meaning of life is. So they set up a gigantic supercomputer, it runs for 300 centuries, comes back at the end of 300 centuries and says the meaning of life, the universe, and everything is 42. Now the rest of you are going, I'm sorry I don't understand what's going on here. I thought we were talking about data quality. Well let's talk about what data means. Data is a combination of a fact and a meaning and if you learn nothing else from this webinar today you've learned the meaning of life is 42. It also will tell you, if you dig deep enough you'll find out that 42 is the, was my age 17 years ago. Well okay, again so what? These are facts and meanings and that's useful stuff. We need to manage it but we need to go a little bit further than that in order to get information. We need to find out information is data that is provided in response to a request from the user. The user may ask, is Peter old enough to buy adult beverages? If you answer the question, well 42, 17 years ago his age was 42 and they're probably going to be able to figure it out. I'm allowed to buy adult beverages. That information is useful but to really get at good information you need to find out how information is used at the strategic level and that really requires one additional layer of complexity on all this. When we take the information that has been requested and then we see how it's being used strategically by the members, this is important. Now you'll notice that if there is a mismatch of a fact and a meaning, therefore no information and therefore no intelligence can occur. Again just as we were talking about Maslow's hierarchy of needs, data foundation is insufficient but necessary to represent information foundation which is also insufficient but necessary as a condition to get to intelligent use of data in your business. So from a data quality perspective we've agreed for a long time, Martin Epler came up with a very good saying that data is fit for purpose. That's highly subjective and it should be. Yes, we'd love to employ data, have not have all perfect data for everywhere in the world and it's just not going to happen. It's entirely too much work to do this. So one of the really interesting stories about all of this is that the US Department of Labor Statistics and the FDA did some work early on where they were rating the vegetable content of various vegetables. Excuse me, the vitamin content of various vegetables and somebody interestingly turned out to make a data error in their calculations there and spinach looked like it had magical properties to it. It was actually a decimal point that had been transposed incorrectly and so it made spinach look like it was two orders of magnitude more potent as a vegetable than others which is literally where the Popeye genre came from. Popeye has his can of spinach and don't get me wrong spinach is good and you should eat your green leafy vegetables but they are not magical. And yet even today we have this persistent myth that data about spinach has got some super super super properties that they use in order to do this. Data quality then is the monotonous with information quality because of course poor data quality means you're going to have less accurate data in order to pull all that up which means we need a definition of data quality management and again this is the activities that allow us to take data and improve those pieces of it that we think are important to improve. Specific roles, deployment, responsibilities etc. all come together in this. It is a absolutely critical supporting process that needs to come in conjunction with a change management discipline and it should be a continuous process such that somebody is always there looking out for the quality. The last definition on this page is for engineering because if you're going to do things that run at hundreds of times, millions of times, tens of millions, I've got some companies that are working billions of times that they're running things. You need an engineering discipline. You cannot expect it to be one-off type solutions and these engineering disciplines are absolutely critical in order to do this and they're generally not well understood within IT or the business. Let's give a specific example here. This is improving data quality during a system migration exercise. The challenge was that this organization had 2 million NSNs or SKU stock keepers units. They were maintained in a catalog somehow during the process of this system. The data had been stored in comment fields. That comment field was a big problem because it means less structure than we'd like to have it. Long story short with all of this, we developed what would today be called text analytics and help this organization by showing them how they could apply data quality method to the semi-automated fashion. Another important aspect of this too is that this shows an example of diminishing returns. You should apply automation until you get less out of the thing you put into it which means you need to measure both. One thing also I won't get too far off track but if somebody tells you that they can convert your unstructured data into structured data hand them a glass of water or a lump of coal and say turn it into wine or or coal. I don't care which. They can't do it and it can't be done. The definition of unstructured data is the data is unstructured and consequently cannot be structured. You can take semi-structured data and make it more structured and there's nothing wrong with that but that doesn't sound as sexy as turning unstructured data into structured data. So the way we like to refer to it is converting your non-tabular data into tabular data. The reason I use this example in particular is because I say as the government more than five million dollars and literally person's centuries of work. Let's take a quick look at how we determine the diminishing returns. In order to determine whether you put more get more out of it than you put into it you have to hold one side of the equation fixed. In this case the left hand side of the equation is the week. Each week I'm going to make up a absolute crazy number of ten thousand dollars a week to employ two of my data engineers full-time on this. I like to have teams on projects so they can work together. Then these engineers did a great job on this. Believe me it does not cost ten thousand dollars a week to run our data engineers. I'm using that for purpose of illustration only. And you'll also see it's important to understand expectations management around this. In this case the end of the first week we told them it would probably take us a couple of weeks or a month to get the results rolling and turned out the end of the first week we had achieved no matching items extracting these items from this large pile of data that was non-tabular. We were able to go in with some careful work though and by the end of the fourth week we had solved 50% of the problem. That was actually pretty good. We'd also discovered that 11% of the data was rubbish and should not be moved which meant our overall problem space had reduced to 30%. Now the first reasonable question is should we stop there with our data improvements? And the answer was the customer said it would be worth continuing to try and get more so they understood that every week cost them ten thousand dollars. And we got to week 14 ten more weeks so a total of ten weeks a hundred thousand dollars is input into this particular process. And we were down to only nine percent of the data had not been accepted now sorry hadn't been matched. The last couple weeks the last five weeks the customer said okay in this case I'd like to give you after this one specific item and that would be worth as much as another hundred thousand dollars to us. We put in five weeks solve that problem and we're able to come up with the results for $50,000. As I mentioned before the ignorable items stayed pretty same we've got to 12% of week four but by the time we finished the project 22% of the data was rubbish did not deserve converting which means that we had 70% of the items that were matched up there and I don't know about you guys but here's the original problem space of two million fkus and now only 150,000 of them needed to be processed by hand. So this was the calculus that went into this let's talk specifically about the quantitative benefits. Two million msm let's just say five minutes to clean each one of them gives us a bunch number of minutes keep going down here at the number of people times the number of hours 93 of them there's my person century 93 person years required to do this at a cost of five hundred excuse me five and a half million dollars. Now I'll move on to the next version which is the lower numbers you can see here that these numbers were much much lower than the other ones again replacing all of these hopefully down at the total cost of seven years to clean the data at $420,000 and the most important number on this chart is the number five because if you think you could solve this problem in five minutes you're actually really really deceiving yourself. So let's talk about some specific data quality misperceptions that you can fix all the data you can't that data quality is an IT problem it isn't IT is neither qualified nor cares what is in there it's just a sad fact that they're not in fact tested and that their attitude is as they can connect to the server my job is done. The problem could be in data sources or data entry it could be that data warehouse will provide us a single version of the truth that the new system will give us a single version of the truth and the standardization will eliminate all of these problems well these are are nice misconceptions but data quality problems are very much like the challenge of the blind men in the elephant everybody knows that we get touch different parts of the elephant you will end up thinking the elephant is different based on your exposure to it and data quality turns out to be exactly that way most organizations approach the data quality problems by only seeing it from their immediate perspective there's little cooperation across boundaries and it leads to confusion disputes and very very narrow thinking around it the solution is that data quality engineering as a discipline can achieve a more complete picture and facilitate cross boundary communication channels that we have so we get into here and say data quality then is data that is fit for purpose the only definition that makes sense let's take that from there and move on for a little bit further again i did some writing in xml world and i said well i haven't more organizations taken a more proactive approach to this and the answer is quite simply that fixing the data quality problems is not easy and once you start stirring things you might make it worse so it's a very good job and since you brought it to our attention you get to fix it now as well all right let's take a look a little further here again many articles i see all over the place four ways to make your data sparkle hey just prioritize the task involve the data owners keep future data clean and align your staff with your business well i don't know anybody that's followed for easy steps to actually make it work instead i prefer to follow what i call a structured data quality engineering approach which allows the form of the problem to guide the form of the solution provides guidance decomposing the problem into some smaller bite size chunks again how do you eat the elephant one bite is a time features a variety of tools for simplifying the understanding of strategy for evolving a design solution providing criteria for evaluating it and then of course mainly facilitating the development framework our cycle that we use this is very same one that most of them work it's called a deming cycle it's plan do study actor plan do check act and the idea is of course let's identify define identify define as we go through here's the planning piece what's the cost and impact is it worthwhile here's the deployment piece let's do some data profiling and see what's actually going on out there let's get the fact based information because some of this is based on fact and some of it is not based on fact i'll show you an example a little bit uh monitor what's actually happening i made a change to do the proof things and then act to identify any specific issues that come out of that you can read much more on deming is no problem it's all very compatible we absolutely embrace that particular view the question is of course will all your data be perfect no you want to apply perito analysis find that 20 percent of the data that covers 80 percent of your problems because not all data is of equal importance in fact 80 percent of your data is redundant obsolete trivial so why would you spend any time doing that assuming you can actually identify which is the perito subset in order to do this our cycle then is also related to the various data quality causes and dimensions so they're two very distinct activities and data quality activities depend on both the first one are a set of practice activities these are activities where somebody affects something that is happening in an operational context i'll show you a very specific example of that there are also structure activities however that requires somebody to set something right for success again if i were to put in an electric car and i don't have an electric car recharging stations around the place that i'm anticipating driving that's a structure issue it doesn't matter what i do from a practice perspective if there are no charging stations i have a different set of problems and these are different they both come together to produce quality data then these practice oriented activities stem from failure to rigorously apply different types of techniques again you might ask somebody to check things as they're on input as opposed to waiting until they're processed in order to do this let me give you a specific example this is a company that i worked with where they were doing some interesting work and the director of the hospital actually asked me to come by and attend a meeting and in this meeting he announced to the hundreds of staff physicians that were in the hospital we're going to do a bunch of knee surgery here this is great because knee surgery is is what you guys do i've ordered a bunch of equipment got in the grant we're going to build a new wing on the hospital it's going to be really exciting around here and i i've looked out in the audience and saw a couple people laughing and so i waited till the individual was gone and i behind them said hey what are you guys laughing about and they all said oh well the hospital directly doesn't know that knee surgery is the default hospital admission code we aren't doing nearly the amount of admitting hospital knee surgery that the individual thinks we are here's the case where the individual thought they were working with data and instead we're working with incorrect data and luckily we were able to head that off and reverse it now that's the other part of structure oriented activities again this is where the data hasn't been arranged properly and if the data is not arranged properly it's going to be very difficult to make your systems work when you're looking at it from a strictly practice oriented perspective an example i'm going to give you here is one from new york city uh just quickly new york city has two and a half million trees in in the 11 month period between oh nine and 10 four people were injured or killed by falling treelifts and just central part of the loan uh we have trees and wind it's kind of a bad piece the arborists in new york city believe the pruning or maintaining the trees can help them make the trees healthier more likely to withstand the storm and decreasing the property damage which would result in overall savings to new york city taxpayers but they had no data to back it up so they looked and what they found was they actually had a problem of the data structured wrong they were able to go back and look at pruning the trees but the data on recording these prune trees was recorded block by block when you cleaned up the data however cleaned up the trees it was addressed at the address level and of course trees don't come with unique identifiers serial numbers uh tree ids you know things like that that requires them to download cleanse merge analyze and do all sorts of intensive modeling but they did finally discover that causing trees to be pruned generally seem to be reducing by a significant percentage the number of times the department had to send a prune to do the emergency cleanup now the best result of most of these analyses is another question new york can't prune every block every year so they have to look around and determine whether they could profile at the block level to do preventative maintenance of this again i'm giving you the example down there i'm sorry the reference for the article before but if the data is structured incorrectly you can't get the information out of it and that is a data quality problem so these two dimensions practice related and structure related practice on top and structure on the bottom are also composed decomposed further into some characteristics and i'm going to show you what those characteristics look like again value representation model and architecture i know i'm running faster these slides because i found that what most people tell me is that they take this and watch it again the second time so i'll let you read those slides when you come back to it but let's take a look at how these things actually come about remember architecture quality and model quality are related to structure issues value quality and representation quality are related in this case to practice oriented issues now the idea here is that when we look across from left to right at the bottom of this chart you'll see from the left hand side of the chart is closer to the user and the right hand side of the chart is closer to the architect which means things that are closer to the user are more likely to be noticed whereas the architect is probably unlikely to be incorporated into discussions on this here is the full set of data quality attributes from representation perspective you can see they are listed i'm not going to read them to you saying closer to the user closer to the architect but let's take another look at this moving from right to left in this case on your screen one data architecture spawns multiple models each model spawns multiple potential values each potential value could be represented in multiple ways so attacking things at the foundational level at the structure level are much more effective in many cases than it is to try and work from the left hand side into the right hand side because there are so many more instances of going from left to right than they are going from right to left we have to pay attention to this and it's a little bit like being in the boat if you look hard in the bottom right hand corner you can see a little boat down there and telling people that from that boat i'd like you to impact Niagara Falls water quality well we could wait until the falls heal over but in general that's not going to happen and we can't actually stop our business to do this not understanding these causes and dimensions leads organizations to look for overly simplistic solutions to the data quality problems let's look now at the data quality life cycle again our data quality life cycle here is a very nice piece but it also represents generally the immaturity of the profession and so i want to take a little bit of time to walk through this our original data quality life cycle model that the data life cycle consists of two activities acquiring data storing it and then using it now this was a very reasonable piece and notice the date on tom redmond's article here is 1993 tom of course has evolved since then if you haven't heard my friend and colleague tom speak you make a wonderful author and have some terrific books out on the subject there that i would absolutely recommend you to but this does represent that in 1993 this is what we thought was real let's take a look at what actually looks like turns out if you're going to do a life cycle you need to understand the difference between metadata and data metadata is data about data and metadata is what is used to create the data structures that we then implement in our systems i can't structure the metadata until i first created it and this relates to data architecture and data modeling activities most people don't think of data architecture and modeling as data quality activities but it turns out they are absolutely key importance and let me give you a very specific example on this one of the ones that i use a lot is then i was with the u.s military in the late 80s early 90s we had 37 systems that paid people in the defense department the example that i like to tell in particular was that it was difficult because when the u.s department of defense would ask the question of the various 37 systems how many employees do you have they would get back to question what do you mean by an employee now that sounds like an impertinent question it turns out not to be an impertinent question they actually meant something because about 30 percent of the DOD workforce in those days worked a second job within DOD working that second job within the DOD that they were counted by the payroll system differently because there was no standard that existed the lack of the standard of course was a major problem so when people would say how many do you have and they'd say well it depends on what you mean by an employee no management was occurring because of this data quality problem in fact that we could be even more precisely if the data quality and data governance problem we should have used standard definitions for persons as we developed all of the systems but of course it's always perfect to critique hindsight afterwards whereas many of these systems were created by really wonderful professionals who were simply trying to pay the cow path that existed in the organizations before and did their best with what they understood the problem to be which was not solving all of the of these payroll problems but instead solving the data quality because it would be a payroll problems for Randolph Airport space just the big one that we worked with over the years so the architecture of this example here it was very interesting a person could be multiple employees is the best business rule to implement for this system however most payroll systems assume that you as an employee have exactly one job now that structural indifference either causes us to undercount the total number of employees or to over count them depending on how we create that and a new system that we're bringing on to play to replace the old system should clearly have the ability to associate multiple a person with multiple jobs if i don't i miss undercounting or over counting 30 percent of the force and we found examples where many workplaces would count that as one individual others would count that as one and a half and so others would count it as two individuals well if i'm trying to ask the question how many people work for the defense department and i've implemented three different sets of business rules for counting people i have a structural problem and no amount of data quality work at the practice level will fix that particular problem we do consider it now at best practice as you are looking to purchase new software to look specifically at the incoming package is it considered best practice to ask for a logical model of the data that is processed by the system so that you can do things like we did at the defense department and check to see whether the new system would support the business practice of any person having multiple jobs native out of the box without any modification because if we modify it we're going to have to remodify it when they add the next version to it and remodify it again and if you've worked with plugins and word and excel and things like that you know what the sort of a mess i can create fully as well so long story short on this metadata creation is part of the structuring activities and this is absolutely important to understand in the lifecycle of data only once i have metadata that has been created and structured should i then look to populate to those structures by creating data within specific storage locations it could be an s3 bin on amazon or it could be an old relational database that you're putting together we once we have created the data store that data for those of you that are young that thing in the middle there that looks like sort of a birthday cake with rings around it this used to be our representation of a spinning platter which is how we stored information on our spinning discs of course the discs don't rotate anymore and putting a flash drive out there i might as well put a knife on it because they have huge huge amounts of data capacity the data once it's stored however can be utilized again there's not much point in storing it if we're not going to utilize it and once we've utilized it we may manipulate that data which means we may add more things to it we may change different aspects of the data there's all sorts of things that can happen from that process that's just general good processing of the data and sometimes that data is stored back into the storage facilities that we have no problem there however in our utilization sometimes we do formal assessments hopefully most of the time we look at least at reasonable this checks to say that the data values are coming up and this is one of the things that as i'm working with the young people that are coming up through our various data science programs and things like that that they will look at data and say okay so i've got the data now what does it mean well if you have a field labeled gender one of the things that might be a really good idea to do is to say how many of the different genders do i have now in the old days we think gender was pretty straightforward we had one or two if you were male or female nowadays of course we know the question is a little bit more new options than that and while you may not have lots of them facebook actually defines for 63 different gender categories that we could have again these are going to be their own types of problems but let's just go back to our overall analysis our generally half the folks showing male characteristics and generally half the folks showing female characteristics and is that the characteristic that your population should in fact have if you're at an all-male boys school you're probably not going to have too many females there if on the other hand you're in a standard public school classroom you probably should have about 50 50 on it although the ratio of females in this country is actually 52 to 48 percent ladies keep going you will eventually win us out by your numbers in that sense so the assessment process is what is the data looking like am i getting numbers that look good if i don't have about 50 distribution and i should one might ask the question why now it turns out one of the projects that we worked on it was a really fun project and we got being tried under Canadian law because it was a Canadian company that were subject to the Canadian Social Security Act Canadian Social Security Act actually maintained nine gender codes that they were supposed to keep track of it would be against the law not to do them under Canadian law and yet there had been a contractor who had done this and basically had written some data scripts that said if m then male in the target system and if else f and of course else f took everybody that was not male and put them in female which again in today's environment would not be necessarily the best way to go about representing that information might in fact actually violate the law under certain circumstances so this assessment process is something that occurs sometimes as part of the utilization sometimes not and that may still result in additional data manipulation of somebody making corrections on it or refining the data where we actually do correct the value defects that occur in there we're missing one other piece of this puzzle as you can tell and that is that we may discover as we did in the defense department that the metadata needs to be refined and only when the metadata is refined can i restructure the model and then get back to where we need to be in terms of actually doing the data quality life cycle now this is a lot more complicated than the one i had on the previous page and it is important to understand this because if i present the previous model which showed acquisition storage and use it's a nice simple story and he can tell but the reality is this is the space that we need to be working with and if we're not working in this space it is going to be much more difficult for us to actually achieve the data quality results that we need to have we may refine the architecture change the model all of these things are possibilities that we can work on in there and it gets a little bit more complicated still if we want to add in additional components i i'm not going to do that here in the interest of time and somebody said well i don't care you can't show me a cycle that looks like a square so you have to turn it into a cycle here though i did and in this case it's the same diagram but that is the upper left hand corner is the starting point for new system development which happens occasionally still in the world but most of the time you start with an existing system and you can see here that it has the same information in here showing that the data quality cycle runs counter to what most people think counterclockwise and actually has these following activities all involved in it so i hope you see it's a little bit more complicated than most people attempt to make it they'd like to tell you a simple story and make it easy to show this but this is of course precisely why tools in general are not the place to look at when you're diagnosing and trying to attack your data quality engineering problems so let's spend a little bit more time here talking about specific tool sets there are a wonderful set of tools first one is what we call a set of data discovery tools and data assessment can be done using two different approaches bottom up and top down bottom up is that we're going to inspect and evaluate the data set we're going to highlight potential issues based on the results of automated processes now once again i was very fortunate to be working for the defense department when they decided to attack data quality in a major way and put some research money behind it and funded some specific research coming up with the answer to the question how much of data quality can be automated it turns out the answer is quite a bit for certain types of problems so this automation process i'll describe to you a little bit further on can be employed in this bottom up assessment which means we're looking at the data in the system and trying to figure out whether it is fit for purpose or how we can make it more fit for purpose welcome back to that a little bit top down then is that the business users are simply looking and saying that our data is of the insufficient quality and we need to put in place an organization-wide initiative that allows people to see how they interact with the data and which data elements are then critical supporting the basic business applications another way to think of this is proactive versus reactive the proactive approach says that you know we know our data not very good and we're going to take an effort to do something about it one of the things i tell groups that i'm working with is that probably nobody at the social security office have a lot of time to sit down and look at my social security data even though i'm approaching 60 years of age and getting closer to retirement age they just don't care what they do instead is that they send me my own information if i look at my information and my information doesn't look right social security is depending on me to wave my hand and say hello there may be an issue out there and could you guys look at it one of the reasons that the government reaches out to us periodically to do this kind of assessment so that's a proactive approach it's not terribly proactive because social security doesn't have a lot of extra budget to sit down and actually reach out to us have individual conversations about what's going on in this now let's talk about measures of data quality and the idea here is that in order to properly measure the way data quality impacts things it's important to select a specific outcome that was not ideal again i used an example yesterday of the Tacoma Neros bridge that only lasted from about july to november of that year and then it fell into the Tacoma Neros river that's not a good outcome certainly there's a critical business impact what would you evaluate will you look at the specific dependent data elements that create and what that create the artifacts that are happening there and they look specifically at the bridge falling i said what went wrong when we built that bridge what did we do that was incorrect the answer was in this case they had fully understood the impact of the wind and harmonic vibrations upon the structure so that was interesting notice also that the best tool for this something we call a crud matrix which says that this business process has these inputs and these outputs and therefore we can talk which data elements belong to which data processes that are involved in it may want to associate specific requirements for the data that happened to be known at the time certainly as you're understanding them put down what you feel would be notional developments then specify the appropriate dimension of data quality that i went over earlier on this and then say what business rules should we use to determine conformance again if the gender of the population sample size should be about 50 percent male 50 percent female then let's check and see what's happening there to make that not the case one of the fun things we do with with um sample data sets around for example is that we'll hand them to groups and say tell us about this and they'll come back with a really interesting finding that all the customer names start with a and then we say well did you think maybe you got a sample of a data set that was sorted alphabetically so we're looking for measures measurements of conformance and determining acceptable thresholds within that as we evaluate these various service levels in here we can then come back and say which elements are crucial to the business which elements are less crucial right now but might be crucial later on and determine what we call service level agreements very similar to if you can pay a lot of money to make sure your conditioning is running at every point in the summertime and if it's not running somebody will come out and fix it right away that may be a premium service whereas a next lower set may be that everybody's busy but we'll get to you on the weekend after we fixed up all of our premium customers so you'll have a couple of days that are hotter again these are different ways of approaching the problem but the service level agreements also determine which part of your organization or partner organization are responsible for the data quality in there and that's another very very important aspect of this many organizations forget to put these in place or do not very good jobs of it and when they do not very good job of it you end up with not very good results again measuring and managing all of these processes you're taking a look and trying to get to at least three levels of granularity you may do this in-stream where you're sampling things as they come by you may get them in batch and do the totals of what happens from the top to the bottom all of these are valid approaches depending on how you've defined the problem the measurements that apply to three specific levels of granularity the data set itself the record level or the data element value level which would be a attribute within a particular instance or record when we look at the overall quality excuse me data quality tool set that we have here there are generally four categories of activities analysis of the data cleansing of the data enhancement of the data and monitoring of the data we'd like to have all four of these the principal tools are profiling parsing and standardization transformation tools identity met resolution and matching enhancement and reporting we're going to spend a little bit of time working through each of these profiling is a great place to start i mentioned before are we funded some research at the department of defense they came up with some algorithms that are very similar in nature to the type of algorithms that you use when you're normalizing the data if you're analyzing a data set there's a way of looking at any set of data applying these rules and deriving what is a logical normal three third normal form set of data constructs that are rounded these algorithms these statistical analysis and assessment of the values that are within the data set and also explore the relationships that would exist between the value collections on this we're not going to spend a lot of time i could do an entire one hour seminar if you're interested in having some information on that and we'll be glad to include that in here in our series of webinars we just have a time that people want to get down into the weeds that dirty to learn about this in order to do this but the quality is covered by these prescriptive rules that we have and they now have different level agreements that require people to let you know when you've discovered types of things that are out there so this is a proactive approach it can also be applied in a reactive approach but if you're doing perspective looking at it remember your data is 80% rot so if we get rid of the 80% we have 20% off dover and if we do profiling to all that 20% we'll have a really darn good understanding of what's happening in there so we could notify the help desk that valid changes are about to cause an avalanche of skeptical user calls because all of a sudden everybody's had their um Facebook password change just decided of an incident that happened in the news recently last couple of weeks and other things like that there are companies out there one of the ones that we work with is a company called Global Ideas it does the most extensive scanning you've ever seen in your life on this this is for industrial strength organizations to come along and and really see what's happening here this is an organization like a big bank that we've worked with recently that had literally 400,000 data sets out there but they didn't know what they did hundreds and hundreds of different major databases and and warehouses and things like that this is industrial strength stuff that can be applied in much lower context but when you're looking at something that the numbers are as i just gave you it gives you a lot of information and it goes through it does this classification that i described on the previous two slides and then it adds a statistical analysis and says oh okay we can also look at data that's over here and its application for statistics is this data has the same set and ratio of values as the others so they may be more related than originally thought and finally in Global Ideas world you can actually add a layer of semantics on top of this that helps out all of these are tools to help you understand the values that you have in your data farm that's out there second set of tools data parsing and standardization again these are things that you would use to go through and look specifically at them if you want to take a look at a tool that does this kind of parsing and standardization take a look at a tool called trifacta that's out of Stanford University the data site is trifacta.org sorry the website is trifacta.org and you can look here and see that you can do very easy changes to dates and streets and standardized different things around here so every time it finds an instance of something that is not perfect it goes in and changes it to something that is more correct we did an awful lot of this parsing and standardization in the example that I gave you with the 2 million skus a few uh minutes ago on that number three transformation and this is etl it'd be the easiest way to think of it it's the idea that stuff comes from this organization in this fashion but we need it somewhere else in a different fashion and this transformation extract transform and load is what you use to fill up data warehouses with things to move data from place A to place B to do all kinds of high-speed data transfers around this search up there's an awful lot of good metadata and data quality rules that is buried in these pieces and it's a really good place that most people forget to go look for business rules when they're trying to understand their data another key piece of this is being able to identify a unique instance now that's what it means by identity resolution we call it identify resolution it's actually a better term for it the idea here is that not that we're trying to identify people but that when we could tie things together with these tools so it de-dupes the records it clutches them into groups there are two different approaches to this there's a deterministic approach which relies on on rules and patterns one of the rules for example in a the data quality exercise if you're trying to determine a person's identity is what is the text they were born with what is the birth date and what is the zip code they were born in and if it's close to the zip code that there is now the probability of that being the person that you think it is goes up by a fair amount because most people in the United States don't move more than one zip code away from where they were born that's a really interesting rule and it means you can find things that way the other aspect is probabilistic so statistical techniques it's not relying on the rules but it can be refined so that we can provide more precise inputs as we do this an issue here might be in terms of the probabilistic piece tracking the two russian guys that are accused of poisoning that russian spine his daughter in the uk a couple of months back they actually went through and came up with all kinds of photo resolutions once they decided who the targets were they went through their photo databases of everybody that had a picture taken of them around the little town they were in in the uk and they tried to apply this to matching things they were able to pretty much put together that set it in individuals day i can remember specifically one of the things they said was they came home to russia because they had they had not like to know that they had encountered and get all the pictures of them walking around in the uk town there was no snow in those pictures so deterministic and probabilistic ways of doing identity matching not that identity is person but that's a thing an event could be something that you're trying to identify with here as well again very very robust sorts of tools of enhancement to your existing data many times you'll get a piece of data in and you'll be trying to do something else with it and so you're trying to combine it with other things again the u.s national threat to security the biggest u.s national threat to security right now is the combination of data that you get when you combine the ashley madison data breach the target data breach and in this case the opm data breach that again is another entire lecture that you get a sense for what needs to happen and the bad guys are pretty good at this process finally of course just reporting so showing this to people and this is the social security office sending me out of the court on a regular basis it says here's what we think your retirement's going to look like for you as we move forward going forward so these are six different ways that we can talk about data quality tools the selection of that tool depends on which portion of the life cycle and which causes and dimensions of the data problems that you're looking at in order to do this let's go at a couple of quick takeaways here and then we'll move to the question and answer session on this the idea is generally that we want you to manage data as a core organizational asset yes we still have to convince some people that that is in fact true but we would like to do it as best we can identify gold records the best source for all those data elements and then more importantly when you find the good one get rid of all the other stuff so that you don't have to have it or at least market as redundant so that you know that you won't be fixing two piles of it it's a great way to keep people doing the busy work but it's absolutely unhelpful in the long run leverage your data governance groups for control and performance of data quality engineering use international standards wherever possible try to make sure that downstream consumers actually understand what they're doing if the data entry clerks at the hospital had understood that that data was being used to buy and train and build they probably would have been a little more careful about bringing people in and giving so many of the knee surgery diagnoses when that was simply not the case in order to do this find it with the business rules are in order to conform to your quality expectations make sure that the business owners agree to and abide the data quality service level agreements enter in process make sure that you can correct your data as close to the source as possible lots of organizations like to say we're going to capture the data once and we're only going to right now look a little good luck with that i've seen a lot of people make progress towards that but it's not really realistic to expect certainly that you're going to get it in the short term in order to do this and most importantly report levels of data quality to the appropriate storage process owners and service level agreement owners on all of this so this is our summary of the overall process we've looked at over the last hour and hopefully you're a little bit smarter about the process of looking at data quality it is not an easy task it does require qualified personnel and we'd like to help with that that's one of the missions here we have at dataversity so i'm going to stop at this point turn it back over to tenley and we will dive in for our questions and answers please by the way i've included in all of this a series of additional reference slides for you guys to use so that you can go a little bit deeper into these topics if you want but now it's your turn let's see what questions you have great peter looks like our first question is from michael and he's wondering how would you approach rolling out data quality and data governance would it be simultaneously sequentially proportionately like a seventy thirty percent split or effort versus resources and so on that's a great question and unfortunately the answer is it depends but what i have seen work successfully in many organizations is that we're many organizations aren't appreciative of data's quality as we sold non depreciable non depletable durable strategic asset that they have as organizations and given that lack of understanding around that it actually is a really good idea to look at a combined effort of saying we're going to do data governance and one of the things data governance will do is to increase the quality of our data that way that's very easy for people to see i mean can you imagine sitting down as executive and going through all of the details that i just gave you there they're not going to want to hear this they want to know why they invested in the wrong thing they push this button and that didn't happen the way it was supposed to you know there's all sorts of things and reasons that can go on but when you tie governance to something very tangible in the organization it becomes much easier for the organization to and accept the very new occurrence of whatever it is we're doing right data governance in this case and it's very clear to see that there's some results so we might say for example we're working with a group right now that has two different catalogs that they're trying to reconcile the customers order from one catalog in order of product and the order from another catalog and they'll think they're getting a different product but actually it's the same product because both catalogs show the same product with two different order numbers on it so in this case the quality would be you're going to stop the number of confused orders that occur or reduce drastically the number of confused orders that occur in this case and we're going to do that sort of process that involves both governing the data and improving its data quality I don't know specifically about your situation Michael but if you have questions specifically you know feel free to reach out and we'll be glad to talk in more detail but I hope that gives you a little bit of context it matters what your problem is and then that will guide the form of your solution I think I said that's a work leader in the hour thanks for the question great our next question is from David and it's a two-part question first are there any preferred open source tools for measuring data quality and then second are there any preferred data quality methodologies good questions both let me see if I can do the first one I may get you to come back and repeat the second one tenly if I get off track the the first one really talks specifically about data quality methods right that was the first one tools for measuring data quality tools okay sorry tools first sorry great I don't have the luxury of writing things down where I'm at here so yeah there are a bunch of tools that are out there and in fact you'll find probably more data quality tools than any other type of tools in the data space that's both good and bad again it it leads managers to think that they can solve a data quality problem by buying a tool and it's really a people process and technology solution that we need to have so just a tool by itself will not help it we have a saying a fool with the tool is still a fool I don't mean to say that people are ignorant about how to use them but most of these data quality tools do address the breadth of quality problems that I showed during the presentation so a very simple tool is a spreadsheet again I'll go back to Tom Redmond he does a couple of examples that I like and I use them and hopefully attribute to him in a way that's that's very favorable what if Tom's favorite thing to do is a Friday afternoon session where you go in and just say let's just look at the data around our top 10 customers so you go pull data around your customers for what you think are the top 10 customers now usually this results in a little bit of confusion because some people say well how can you say that this company is not one of the top 10 customers it's the boss's son of course they should be a top 10 customer and somebody else as well they may be a the boss's son and we want to choose from special because of the familiar relationship but golly they don't do very much business so how do we measure our best customers which leads to all sorts of questions and things like that that gets into another interesting aspect of data quality how do we actually measure what is a good customer for example am i a good customer for mastercard because i happen to use a mastercard for a lot of my purchases uh maybe i pay them a hundred bucks a month for the charge card because it also gives me status for boarding but i never pay them any interest so does my hundred bucks a month cover the cost of that card loading all that money that they give me during the week as i'm going and racking up my bills until i get to paying it all off at the end of the month i don't know so the spreadsheet can be a great place to start you can just put data in the spreadsheet take a look at it see what's going on interesting very low tech very approachable guaranteed somebody in your organization somewhere has a copy of the spreadsheet that they can use to do this well it's good at the other end something like global ids being a very uh comprehensive set of uh products here again i'm not making ad for that global ids i've just got permission to use their slides so it's easy to talk about there's several of these uh that fall into that family but you know if you really need to have something that comes in as scans and finds all your hidden databases if you're a medium-sized company probably not if you're a big bank that's growing through acquisitions yeah that might be a good approach to take a look at so standard tools really not we have some organizations that do well with data quality tools and they will have a different conversation with you because they're going to ask you questions as to what your needs are and therefore try to match you with the right tool and the right need whereas a tool vendor with a more simplistic approach is just simply going to say buy my tool and it will fix data quality problems and the answer to that is probably true but as you've seen there are lots of different types of data quality problems so the question is will that tool fix this kind of problem if this kind of problem is the pervasive problem that i'm dealing with at the moment another approach to tools too is to look at the possibility of renting the tools you may have a very vexing data structure problem that can be fixed but you don't really need to buy the tool um to have a perpetual license for forever and ever you only need it for the time that you're using it and there are ways of renting these data quality tools mainly because the vendors are interested in income and they'd rather have you use the tool for a couple months than not using it at all looking at new ways of doing the licensing models and things like that there's some companies like Informatica again a very fine company that has lots and lots of offerings in the data quality space they certainly have a range and a variety of both profiling and other types of tools that we've used over the years very successfully even number of projects so the standard of the tool is really no if you go to the expender conventions such as the one i'm at today there's probably a dozen vendors doing data quality tools and larger companies are going to be even hundreds of them doing this kind of work the question is what do you need and that's the role that your business people and your data people can best determine and then select the appropriate tool for the appropriate task so that was the tools piece on that i hope i gave an answer i hope it wasn't disappointing to you it's a very big but there's a lot of tools that you can oh i'm sorry one more tool to throw into here in addition to spreadsheet your next step up from there is going to be SQL server in fact it turns out that when i'm teaching this to various groups somebody in the group will very quickly figure out that just by running a program set of SQL queries against the database you can largely accomplish many things that data profiling tools accomplish so while that's interesting it doesn't give you the retained corporate memory that these tools come up with but you certainly get the immediate results in a very very quick fashion on this so again i would do i would do you know spreadsheets um you know SQL server type activities because you can do an awful lot with that and then from there up into the more specialized categories that's the tools piece the methods piece is different if you look at the word methodology it actually means the study of methods so i don't like the word methodology but in terms of a standard way of approaching data quality diagnostics i've looked for it as well and it turns out that's not there's no good book help there that's been written it's really kind of disappointing and i kind of going to myself oh well i guess i'll have to write some bad information into my next book because there does not seem to be a good source of information on that again it's more like the thing i showed you earlier in the presentation where four steps to make your data sparkle just do these four things and all your data will be fine right it's kind of like telling people don't eat sugar you'll never have any cavities it's a nice spot but if you understand that make your ketchup and many other things in our society they eliminating sugar from your diet is an extremely difficult thing to do so it's not as easy as just four little steps um what i would say though from a methods perspective is to look at two aspects one does it help you fix the data from a practice oriented perspective because that's really what we're talking about here so apply data quality to data structure challenges that's going to require a different set of tools so that's your first bifurcation point the structure tools you're probably only going to use one time right to fix the problem of the data quality challenge in that area you may use your data practice oriented tools continuously you may want to have something that's always watching all the traffic on the network to find out when things come in that say hey you know up is down and you really don't want to have up is down in order to do that i i hope that answered your question it's a bit of a difficult one and the the goal there is just be very wary of anybody's approaches you can just buy one thing and it'll fix all your data quality problems because i hope that this presentation to show you that's simply an impossible task and our next question um is more surrounded around the actual data so do you have any suggestions for correcting data at the source when the source is not under your organization's control the best mechanism for doing that is what we call the service level agreement SLA your organization and another organization are exchanging data for a purpose if they're exchanging data for a purpose and that purpose isn't specified being the terms and conditions the service level agreement then you have the wild left i'll give you a very specific example of this a magazine company that we worked with had a model that was based on subscriptions and it's an old model as you can tell let's just pretend it was Rowlingstone magazine it wasn't but Rowlingstone magazine would make a nice example of this if it were actually them and people subscribing to Rowlingstone magazine get bombarded with coupons and offers and Girl Scouts are giving it away and all sorts of things out there this magazine decided to outsource its subscription acquisition process it's an important part of the business you know it's the fuel that essentially made this magazine run by outsourcing it to another organization they were a relatively immature organization and didn't really have a good service level agreement in place so the agreement was that they would send them x number of hundred subscriptions per month they didn't actually have anything in there to specify that the subscriptions had to be valid subscriptions and i know that sounds crazy but yes uh so they were sending lots and lots of rubbish over the line and the magazine company was having a problem with its subscription base falling and falling and we have to see the service level agreements and that's of course when we found that they didn't have one this is oh well we can correct this pretty easily let's put in place a service level agreement that says that you can only send the subscriptions that are people that have been verified in oh let's say we're going to bounce it off of Equifax or TransUnion or something like that to make sure that real people i'm not suggesting by the way this is a good way of doing this i'm simply saying that by verifying these accounts we would have a much better quality of input that comes in because it turned out when we measured they were actually sending over two records that were not useful for every record that was useful and i don't know about you guys but if i'm paying for something and i'm only getting 30 percent of what i pay for i'm going to go back and renegotiate that with the vendor again great question thank you for it okay and in your experience where do most organizations go wrong in terms of data quality just by not understanding it fully i'm working with several organizations now that have said they're going to become data-centric organizations and we're going to have a hundred percent of our data all clean and we're going to do this by buying this one tool over simplification of both the process and the results the processing approach and the results and and expecting things to actually get better as you've seen from this presentation data quality has a lot of nuances for example one of the my favorite examples of data quality is that i bought a microwave oven from general electric and being a male i broke the tray on it right away the glass tray that spins because of course if you're going to buy microwave oven and you're reheating stuff as most of us males do yeah you want that thing to spin because heating it up with a spinner makes it heat less evenly consequently when i broke the tray i went i better fix that or might also be upset with me so i went to look at it online and found that the cost of the replacement tray was actually more than the cost of buying a new oven entirely very frustrating exercise actually i went to the goodwill store and they were able to find the exactly the model that i needed to replace it with is that a data quality problem well perhaps it's unlikely that the part for that tray is actually worth more than the oven was but on the other hand could that be a surreptitious attempt by general electric appliances division to try and force me to buy more ovens because they get um the stock price is more influenced by more number of ovens that they sell than it is the number of parts of ovens that they sell so this also took me various aspects to get into this the idea is what is your organization doing and what are you trying to do you have to have good goals and objectives in order to come up with this otherwise you can end up spending way way too much time and effort on things that are not really productive for the organization i'm not sure i wandered on that one timley did i you think i did a good job i have been that one um i think it came across pretty well um our audience is is great with their feedbacks if i didn't do it right please ping me and we'll try to do better on that oh yes we we definitely welcome any clarification around that um so let's move on to our next one in which we have um a chat that came in from alford asking how do you select proper dimensions in data quality within banking as you as in using too many scorecards may or may not dilute the purpose hey great uh and it's not just scorecards but we're seeing dashboards as well oh we want some data quality dashboards well if you're looking at a nice pretty display and it's all green do you really need a thousand green buttons on it or can you in fact make it a more simplified approach and say you know things are trending so it's the difference between precision and aggregation that you're looking at here banking in particular is going to have interesting numbers one of my favorite data quality exercises i did with a group with a midwest credit union where believe it or not two vice presidents would sit down and argue about whether sales were up or down in front of the executive now if you've got two of your lieutenants and one sales are up at the other says sales are down one thing you know as an executive is they can't both be correct so to try and correct this problem we needed more clarification around the definition of the sale the various responsibilities products impact on product mix etc etc and in the banking industry we've worked with six of the ten largest banks in the world there are very specific measures that in at least domestic us the government wants you to do very precise things around that and conforming to those places is a really good place to start because if you're not compliant with the government activities yeah your stockholders are going to be less impressed again sometimes it's the case of just trying to stay out of the papers but more often it's the case that many organizations don't really understand the proper approach to quality is to make it an integrated part of everybody's job to not just have one group responsible for it so that implying that everybody else can be lax in their approach to quality while we see dashboards as useful it's kind of like people thinking that they're going to solve their analytical problems by going out and buying really fine products like Tableau and ClickView or Amazon quick site and a various different pieces like that all of these are fine but you name more than just the tool you need to have the people and the processes in place in order to make this stuff work all the way around in the banking industry there are some specific user groups we're associated with one called FIBO it's a terrific organization that talks about the maturity and banking practices and has some specific work around that if you have trouble locating them on the web for some reason google it's not working for you can you shout and i'll point you in the right direction they do participate in the enterprise data world conference which is the one we're headed up to in Boston in the spring mid-march so that would be a good place to go learn more about these can i hope that helps looks like our last question for today comes in from Mike and it just states it seems today there are too too many quick fixes and approaches to taking data sources and using them in analytics without first understanding the rules about the data or the metadata can you weigh in on what your you know feelings are around that my experience is a hundred percent agreement we see an awful lot of people and again the example i gave you with the knee surgery is a perfect example of that garbage in garbage out where we had an executive who was making very good data-driven decisions without realizing that the quality of his data was unknown and of course what we'd like to do is make sure the data of his quality of his data is known not that it's perfect but that it's known for example if you had told the executive that the admission data tended to come in and it was generally about 30 percent accurate that individual would have made much less of a commitment to knee surgery and would have instead gone and said okay let's find out more about the data it is not critical that you know all of your data is perfect what is more important however is to find out your critical data elements and what level of quality they currently have because if we don't have a good idea of what the data quality is it is unknown now again i can't tell you how many organizations i've walked into where i asked the question can you tell me that the quality of all the data in this organization is unknown of course the reason you wouldn't want to hear that is because it would be a lot of companies that i've worked with and you wouldn't want to hear that they don't know about the quality of their data let's just pretend i was working with an airline and the airline said yeah all of our data is of unknown quality by the way i've never worked with an airline that says all of their data is of unknown quality but you would really question about whether you should be getting on that airplane or not uh when you when you hear that kind of a situation so yes there's an extreme over reliance on the technology component to this and that's a great place for everybody to stop today if they know let's not do tool specific a focus it's going to take a balanced approach of people process and technologies and that those should be implemented in a way that makes sense for the organization to make the data fit for purpose so that your organization can then move on with what it's really trying to do which is the business where's our next thing there we go there's our next thing our next webinar we've got is november you're gonna do data architecture versus data modeling tell me i look forward to joining you on november 12th that would be lovely i can't wait so so right now looks like we're we're uh through our question portion and we'll be giving a few minutes back to every wednesday and just a reminder that we will be posting the webinar recording and the slides to dataversity.net within two business days as well as sending out a follow-up email to let you know the links and other requested information throughout the webinar so thank you again peter for a great presentation and thank you all for attending today's webinar i hope you have a great day very good thank you so much tenly talk to everybody and take care bye bye