 Hello and welcome. My name is Shannon Kemp and I'm the Chief Digital Officer for Data Diversity. We want to thank you for joining the latest in the monthly webinar series, Data Architecture Strategies with Donna Burbank. Today, Donna will be joined by guest speaker Nigel Turner to discuss data quality best practices, sponsored today by Calibra. A couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them by the Q&A panel or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag DA Strategies. And if you'd like to chat with us or with each other, we certainly encourage you to do so. And just to know the chat defaults to send to just the panelists, but you may absolutely change that to network with everyone. To open the chat and the Q&A panel so you'll find those icons in the bottom middle of your screen to enable those features. And as always, we will send a follow-up email within two business days containing links to the slides and recording of the session and any additional information requested throughout the webinar. Now, let me turn it over to Eric for a brief forward from our sponsor Calibra. Eric, hello and welcome. Thank you. Thank you so much for that welcome Shannon. Let me just go ahead and share my screen. So, in keeping with a focus on data quality and representing Calibra as hopefully not only a sponsor of this webinar, but happy to join the Q&A afterwards. Just to take a second. As a data quality principle for Calibra, I couldn't help but take a moment to show you a video on our data quality offering and hopefully it aligns to some of the topics and discussions that you'll see today. So without further ado, thank you so much. Just appreciate being able to share that with you. Like I said, hopefully some of those features and opportunities aligned to the discussion today. Looking forward to that discussion. Shannon, back to you. Thank you so much for kicking us off with this and thank you today to Calibra for sponsoring and helping to make these webinars happen. And if you have questions for Eric or about Calibra, you may submit your questions in the Q&A panel. As you mentioned, he'll be joining us in the Q&A portion of the webinar at the end. And now let me introduce the speaker of the monthly series, Donna Burbank. Donna is a recognized industry expert in information management for over 20 years of experience helping organizations enrich their business opportunities, their data and information. She currently is the managing director of Global Data Strategy Limited, where she assists organizations around the globe in driving value from their data. Joining Donna this month is special guest Nigel Turner. Nigel has worked in information management and related areas for over 25 years. This experience has embraced data governance, information strategy, data quality, data governance, master data management and business intelligence. He is currently the principal consultant for Mia Region at Global Data Strategy Limited. And with that, let me give the floor to Donna and Nigel to begin their presentation. Hello and welcome. Thank you so much, and always a pleasure to do these. And for those of you who are new to this, this series, I'll let it be known that this is a series. So if Nigel you could put it in full screen mode, that would be great. This is a series that we run monthly. Data diversity is great about keeping all of the previous sessions on demand, I think in perpetuity. So if you missed any of the earlier sessions this year, you can go back to both the Data Diversity website as well as our Global Data Strategy site where we post links to all of that. So as Shannon mentioned this month we're doing something a little bit different. We've done this. I think this is the third year. We've invited Nigel Turner, who is a special guest that's always been well received kind of an expert on data quality. He's our principal consultant for Global Data Strategy over in the Mia Region. I saw a few folks in the chat typing Welsh, I think so yes he's from Cardiff, Wales. And we wanted to just delve in today on data quality so for good or bad I'm going to be speaking a whole lot less today than normal and letting Nigel really kind of take the floor and share some of his expertise as you may have seen in the in the bio. One of Nigel's great pieces of experience is he did a lot of work at British Telecom or BT on data quality and at some really great success stories so he's going to share some of his experiences from that and kind of tie it to Nigel's idea of the data management life cycle, because as we talk about in all of these sessions, none of the either dam a dam block or data management practices really do live in a vacuum. They're all interrelated which is why this is a series. So, kind of take a step back and look at that data management life cycle and really kind of delve in with some hopefully as we always try to promise on this webinar and some practical tips on data quality and how that ties in. Next slide. Again, as I mentioned data quality is part of a wider strategy we will touch on some of these as part of that data life cycle if you've joined our webinars you've probably seen this framework we always like to share it because, you know, if we work with a client and the client wants to get a warehouse we can't even think of warehousing before we understand is the data governed, you know is the data of high quality is the right architecture in place for a warehouse or you know etc etc they do all tie together. And without tying ourselves and not we like to sort of focus on each one each month and then you know kind of have the touch points to different areas so today's focus is on data quality but you will certainly hear things about data governance and obviously data life cycle and asset planning, metadata management etc etc etc. So there's a question that we always see, you know, will the deck and the slides be available. They will be a link and that these are all beyond the man so you'll be able to see all this later. If we go to the next slide. The other thing we always like to do in this series is bring in data and show some some survey results and there'll be a few more slides like this. So the type of thing is an interest to you a couple of things one, this survey is available each year, global strategy and data diversity kind of partnered together to do a trends and data management. I think that the one for this year and be coming out in the next month or so. But this is from last year and we did a whole session in January on these trends. So this is one of the slides if you wanted to hear more definitely catch that that January session because we'll go into a lot more detail but what I like about this slide ties nicely into the webinar. Is that you know data quality has been, I venture to say will continue to be a priority for any data professional, and you'll see kind of from the last survey. It was sort of number six in terms of the top priorities for folks doing anything with data, you know, what why you'll see business intelligence and data warehousing are kind of a lot of the drivers for doing things with data, and you need data quality for that. I want to ask people what is a future priority data quality is still up there in the top but you're seeing it's moving up even up to number four. Part of that is that the one that's right above it is that self service reporting analytics when folks start to get their hands on the data and trying to do more. The data quality comes to top of mind and it's always been a frustration to me or a sense of humor at this point when you might be telling if you're in the tech side and telling the business that the data quality is bad we need to do something about it. It often falls on deaf ears until those folks find it themselves in a report and then they come to you and say that, you know, did you know the quality was bad. Tell me more. So the more people are either self serve on the data prep, or the data reporting, you'll see that as well data strategy, obviously trying to understand where you go with data and the governance is, you know, not surprised by any of those that's a big part of our practice. And we'll also definitely tie into it Nigel's kind of speak with so kind of thought I'd set the stage with that. They're just, you know, no surprise data quality is continues to be important it continues to tie with these other disciplines. And it's only growing and important so with that I'm going to pass it over to my colleague Nigel and Nigel wanted to share some of your experiences. Thanks. Thanks Donna and thanks Shannon and good morning good afternoon good evening depending on where and when you're listening to this webinar. And I'm here in sunny Cardiff in Wales as somebody's already noticed and Donna's sort of the survey that Donna showed I think is good news for data quality in that people are getting to recognize very slowly but inexorably that data quality is becoming a must have for any organization that wants to be data driven and digital. But there's another survey was done by James Koblius, fairly recently only published a couple of months ago. And despite the growing interest and the growing focus on data quality there are still a lot of barriers and challenges to getting data quality and if you look at a table on the left hand side there are 17 of the most popular if you like all the most popular the most the most demanding challenges that organizations have to face. And what I did was look at them in terms of people issues process issues and technology issues. I think one of the things that it shows that roughly six of those are primarily people issues, for example getting the support of the senior executives in an organization for improving data quality. Five of them are process related. So for example, you know introducing stewardship and creation within the governance framework I would say is both a people and a process thing. There are also technology issues as well, like for example, you know, scaling data quality to the entire enterprise. So I think all that goes to prove is that nothing changed in that respect in that if you're trying to tackle data quality, you have because the problems are holistic they're about people they're about processes and they're about technology. The only way you're going to solve them is holistically. The solutions or improvements that you put in place must embrace people process and technology, and you've got to overcome some of those business and technical challenges that you see, but on the left hand side. I think the other thing that this demonstrates as well is that these challenges embrace all stages of a data lifecycle from data creation, you know to data usage, and also publication. You know, putting policies in place is something you would expect to see before you even create a new data source for example, and then self service obviously comes towards the later end of the lifecycle where the data has been ingested it's been processed and stored, and is then published out to users who then start to do some reports and analytics on that data. So you can look at it being holistic in two ways. First, what is holistic in the people process and technology sense but it's also holistic in the sense that it embraces the whole of the lifecycle of a piece of data. And when I saw that survey gave us the idea that maybe something that I learned many, many years ago when I started in data quality is as relevant today as it ever was. I'm sure there'll be many of the mature people in this audience will recognize this one 10 100 principle. Other people may not be so familiar with it so I thought it's worth just touching on it, and to try and show how this links to this concept of a data management lifecycle. Originally this type of thinking and the principle came from something called total quality management bit of a potted history and how data quality management came to be a discipline in its own right. The PQM as it was called was really developed by people like Geron and Deming back in the 1950s and 1960s. And what they, what you know that one of the key principles of total quality management is that if you've got, if you identify issues and problems in a manufacturing process, then it's hugely cheaper to solve and correct those problems before the product itself is manufactured. And certainly before it hits the, hits the market to an example you design a new washing machine, you put the spec together and somebody spots a problem with the spec, you correct it you correct the spec cost of doing that fairly small. Let's say that you don't spot the problem until you start manufacturing it in your factory. As the washing machine is being manufactured, people start to think, ah, the rubber ring that seals the door seems to be rotting very quickly when we test it. That means we didn't, we didn't specify the rubber ring correctly, so we need to go back and we need to resupply and reproduce the rubber rings and insert them into the washing machines. Now as you can imagine, that's a more costly and difficult thing to do than if you'd spotted it at the beginning. What it really gets costly is you don't spot these problems until you start selling the washing machines to your customers. And then the customers start to complain that the rubber ring is rotting. And if it's under warranty, you're then going to pay for that to be replaced under the warranty. And if it's not under warranty, your customers have to pay for it as well. So basically that those principles that apply to a product. And some of the early thinkers in data quality management thought, well, data in that sense can be seen as a product. So the idea of data as a product, which is quite a trendy term, is not a new term at all. It's something that came from the early days of data quality management and people like Larry English and Tom Redmond soon recognized that the same principle would apply equally well to data as it does to a manufactured product. And if you have quality problems with data, if you can, if you can rectify those problems before the data is even created by anticipating what the issues might be, you prevent problems. If you load that data into a data warehouse and then spot problems with it, you can correct it, but it will cost you a lot more and it will be more difficult to do. But if you've already passed that data out to consumers and users and they spot the problem, then it's going to cost you even more to fix it because it means that the data is distributed much more widely. Moreover, if people are working with poor quality data and making decisions say on the basis of that data, not only is the data going to cost more to fix but it's actually doing damage to the organization as well, because people are using bad data to make bad decisions. So that increase is exponential. And when I first got into data quality, I thought, you know, this seems a little far fetched to me but then as Donna mentioned, I spent a lot of my career working in a large telco in the UK called British telecommunications or BT. I was tasked with setting up an enterprise wide data quality program, which as you can imagine was a bit of a daunting task in an organization as big as BT at the time that had over 100,000 employees and operating in 50 plus countries. And then a bit of luck came one day, which is the lady you see on the top left of the diagram is actually my mum. My mum for various reasons needed to order a new broadband line from BT, who I worked for at the time, and I encouraged her to do that and I gave her the number to ring and she rang the BT call centre and gave them her details and they said no problem. We'll get it all sorted for you and put broadband in your house. Unfortunately, what happened was the call centre agent couldn't hear my mum properly or was careless with the data that my mum provided and she input a wrong physical address for where my mum lived. And that was bad enough in itself, but then what then tends to happen from there is that if the address is wrong on as the data is entered, unfortunately then that data is downstream to various platforms and various systems that BT used to actually fulfill that broadband order. First of all, at the top that's the telephone exchange and basically they might connect the wrong address at the telephone exchange, then they have to connect to the cabinet by the road to link the house to the cabinet. And if the address is wrong, they might link the wrong address there as well. And then it got even worse because then when my mum needed some equipment to be to be installed in the house, they would send the guy out in a BT van to do that. And guess what? If the address was wrong, that van driver, that engineer couldn't find the address. And then of course other things start going wrong as well. That meant at the bottom there, my mother wasn't included in the online directories enquiry service that BT provided. And also in the paper phone books that were produced in those days, it also meant they said she couldn't get built properly and BT couldn't market to her because they were sending marketing material to the wrong address as well. And the key thing about that is that all of those were customer touch points. So not only did the failure occur in the first place in terms of the broadband provision but it's sort of then reeled through to everybody else and maybe worst of all, that database that contained those addresses was fed to the UK emergency services database, which is in the UK 999, it's 911 in America, 112 in Europe. And that meant that if my mum, heaven forbid, had a fire and she dialed the emergency services, then there's a very good chance that the fire engine would have gone to the wrong address as well. And that could have cost her life. So we began to realise very quickly when we started looking at DQ, that one may seem to be a fairly trivial data entry error, can actually have catastrophic effects throughout the whole organisation and outside. And the conclusion from that, and I didn't take a genius to figure this out, I must admit, so that's why I managed to figure it out, that the way to stop these problems was clearly get the data entry right. And that we soon learned as well, because we'd run some workshops and how we could fix this, that if you get that data entry correct, that any cost incurred in improving the quality of data at entry, you know, whether that means changing the business processes, providing things like drop down menus, having master data, so that the address data is held in one place and provided to the call centre operative. Any of those costs were totally insignificant when you compare them to all the failure costs that happen downstream. And of course you can imagine that eventually that address had to be corrected for my mum and I was involved personally in that. And it meant that all the systems that all those functions depended on the address had to be changed in all of those as well. So that was basically how I started life in looking at data quality within BT, it was a very personal thing for me, but it also gave us a great starting point to begin to demonstrate the data quality was worth doing, because they were huge costs of failure if you don't get it right. And also this became a very sort of good case example to say this is why we need to take action and to give you an example why it was so powerful. And as I said, hope you can indulge me on this but it's bring brought back a lot of memories preparing this. You know, that address improvement initiative became the first major project that we tackled in what became a 10 year program eventually. During the 10 years in BT we ran over 75 data quality projects, and some of them range from tactical data cleanses because obviously a lot of this bad data was already held in systems. But also we were looking at how we could prevent problems so in the one of the 110 100. So one of the key enterprise wide projects that we did, which included address was we were one of the first companies I think to actually build a master data management applications customers. And obviously that included addresses as well. So we were able in other words then to stop addresses being incomplete or incorrect at source. And we having run all these we got a lot of benefits from it for some of the reasons you see there. And we were able to validate those benefits and they eventually exceeded one billion pounds, one billion dollars, sorry, one billion dollars over a 10 year period. And that was brought to the attention of various people like Garner and Tom Redmond Larry English even, and they also got interested in what we did and commented on that so I suppose the lesson in that this is my way seems like ancient history I know. But I mean a lot of the problems that the BT face then quite a few years ago are still faced today by companies that Donna and I and some of our colleagues talk to and I'm sure you'd say the same thing as well. So, you know, clearly, it got us thinking as well in all our future products then about that 1,100 principle and it became our mantra. So when we spotted you know when we identified a data quality issue the first thing we would do is go back to the roots and look at the root causes of the problem, which were almost always to do with data entry or data ingestion and concentrate the fixes there and then concentrate on preventing future problems and that was before we even thought about doing data cleansing of some of the bad data that already got into the systems. So this we thought the idea to this gave the idea of this idea of a data management life cycle. And if you think of data I mentioned earlier about a data being like a product. And I think pretty much most things have a natural life cycle. The universe has a natural life cycle, living things such as plants and human beings also have a life cycle. When I looked at the human life cycle given my hair is the same color as that ladies. I'm a bit worried about the next step of that life cycle for me. And then in the same way products as I've already mentioned have a lifespan and data is is no different because data has to be created or generated from somewhere. It has a useful lifespan. And I think that we all know that data tends to reduce in value as its currency declines. So the older data is usually not always the less valuable it is. And then it reaches a point where it can actually be discarded or disposed of but again not in all cases are a good examples of very old policing data for example, being used by data scientists and dug up from maybe 50 years ago, and then used to do better prediction of crime, using various AI and machine learning applications so you can't all you can't say that old data is useless a historian certainly wouldn't say that. And certainly a data scientist might not say that as well. So what is this idea of a data management life cycle and it's well described there by the UK government who are taking good news is we they're getting lots of things wrong, but they are taking data quality seriously and they built a national framework for data management within the UK government, and I'll quite like the description of what they mean by a data management life cycle. So it's a way of describing the stages data goes through from designing it and collecting it to then disseminating it to the users of that data, ultimately through to archiving it or ultimately deleting it or destroying it. And I thought it's quite a useful perspective to look at data quality because, you know, any organization with all the data that it has will have data at various stages of this life cycle. And I think the important thing is you have to think about data quality in terms of all the stages of that life cycle and very often organizations are not pretty good at doing that. Another implication of it as well is how you manage data will vary according to which phase or which stage of the life cycle it's in. So the next question is what are the stages of the life cycle. Well, you'd be surprised to know or probably not surprised to know that there is no common single agreed picture or what a data life cycle looks like. What I did was I went to the UK government framework, and they've done some really good work there and you can access that if you're interested in it through the UK gov website, but also goes good old dnbok the Dharma data management book of knowledge, and both of them have life cycles with the slightly different. So what I did was I took the best of both and I tried to produce this life cycle of my own. And as you can see from this basically that you know I suggest that the data life cycle typically has six stages that the first stage is one where you plan. In other words, you've identified that you need to generate create some new data. So you need to plan for how you're going to do that. The second stage then you actually begin to collect and ingest or inquire and ingest that data. Then you need to think about how you prepare store and maintain it. Then you need to think about how you're going to process it and how people are going to use it. Then you share and publish that data. And then the reason I got this as a circular thing because I think the data is slightly different from other things in the sense that data has a reuse value. So you might decide at the end of stage five. Yeah, the data that we are using maybe is beginning to become a bit obsolescent. So we have a choice. We can either archive or destroy it, or we may say, I tell you what, if we refresh and enhance that data in some way, then we may be able to do more useful things with it. So then you might go back into the plan stage and make a case to say, let's enhance and refresh the data, and then the data management life cycle might start again. So in that sense, it's almost a bit like a phoenix that, you know, it might be on its last legs and then you sort of set a fire under it, and it can be very useful again. So those are what I see as the main stages of the data life cycle. One of the things that as well when we were looking at this that we were thinking about is what's the difference then between a data management life cycle and what's known as data lineage. Yes, there are a lot of similarities between them, but I think they are two separate concepts in data and in the data quality field. The first I think is the data management life cycle and the data lineage both have some things in common, obviously, okay, the idea of a sequence of processes and activities, and that implies that there is a certain chronology. That means you can't share and publish data until you've collected or acquired it. So the clearly collection and acquisition comes before sharing and publishing. So there's clearly a time dimension to that. And both as well described the way data can change and transform as it passes through that sequence. So what you collect at the front end in the data management life cycle might be transformed and adapted and enhanced. In stage four, it might look quite different and data lineage in that sense is similar because as the data moves from origin to its point of use then that will change as well as it passes through various platforms various ETL routines and all the rest of it. So there are differences I think I think a data life cycle is much more about what the business does with date with data, because it's really the usefulness of that data and the way that data is managed, right from its, from its creation, right through to the data may be disposed of as I said refreshed Phoenix slide, whereas data lineage I think is much more an IT responsibility where it describes because you're describing the physical flow of data as it moves from where it originates through to the point where it's actually being used. So those things do relate to each other, but they're not synonyms. They are I think different concepts that come at the concept of data sequences from slightly different angles. So the question is, what is this going to do with 110 100. Well, you probably figured out by now that the two actually actually aligned quite well. So what I've done here is simply put that circular data life cycle into into a linear format, and then sort of map the 110 100 principle to it. So if you think about that, then the more the sorry within the data life cycle the further down the life cycle you get the more it will cost you to fix any data quality issues. So in other words, during the planning stage if you can anticipate predict and prevent data quality problems, it'll cost you a dollar to fix. If you begin to spot that as you are you are as you are storing it within a data warehouse or a data lake, and problems become evident you can fix it there. And but also that will involve the cost and that cost will be higher, because the reengineering of that is going to be a lot more costly than preventing the error in the first place. And then to share and publish, as Donna said earlier, usually at that point, it's the users of the data that spot the problems, report them back to the people who are responsible for managing that data. And then they have to fix it once it's out there in the public domain, where people are beginning to share and publish that data. And that's going to cost an awful lot more to fix plus the fact to go back to what I said earlier it's not just the cost of fixing the data. And maybe at that point where it's published and people are using it people are making bad decisions, for example, because the data isn't fixed for purpose. So, if you can see there, you know the exact very much the same pattern. So what that implies is for data quality management, the earlier in the data lifecycle, that you can actually start to address and fix problems. The cost is going to be to fix them. And the less expensive it will be to fix them exactly like the BT example if you can fix that. If you would fix that problem at the planning stage when the data was being prepared for the use of the contact center, and people have started to think about how do we ensure that we get addresses right, then they may have prevented a lot of the problems and in respect what happened in BT to fix the problems, you were almost at the use and process and share and publish stages, which meant it was a lot more complicated and a lot more costly to fix. So fixing data quality problems as early as possible makes makes good sense and perfect sense. So what I'm going to do very quickly now we've just run through and I'm not going to go through these in detail you'd be pleased to know, because of time constraints. But I just thought it was worth just going through the various stages and a little more detail just to explain what the purpose of each stage is. And I think the first the most important thing of all is that the planning stage I think is the most neglected stage of data quality management, but actually the most important for the reasons I've just said. And I think it's also a key reason why so many data quality problems persist. And, you know, you can see you can see that the purpose of the stage if you're going to create new data or acquire new data that has a cost. And the first thing that you would say about that is well what's the business case for doing that, who in the business needs this data what are they going to use it for, and how is that data going to prove a positive return on investment if you invest in it. So the purpose of this stage and also then to create a plan okay we do need this data the business case is proven. So what data do we need to collect how are we going to collect it, and what's our plan for ingesting that data storing that data and making it public so at this point, a lot of thinking needs to go on. The more thinking you do here, the more preventative that your approaches, the fewer problems you get later on in the data lifecycle. There's a bottom left I'm not going to go through all of these, but you can see what some of the key activities would normally be at this stage of the of the life cycle. You know, one of the things for example you would expect to do is check that that data doesn't already exist elsewhere. That's one of the key problems that many organizations have is uncontrolled and unknown due to date data duplication. So somebody thinks wouldn't be a great idea if we collected some data on our customers that we don't currently collect. And they put this into place without a proper business case without a proper plan, and they discovered that another part of the business has been collecting that data for the last 10 years. They just didn't know about it. So it's very important to do things like that you also then need to produce a requirement spec. And even at this point before you collect any data, you need to think about the business rules and the data quality rules that you want to apply to that data. If you're collecting address data, what makes a good address. Can you enter an address and need fields blank, for example. So you need to decide and define that before really you can decide where you get the data from. And then of course there are lots of different ways you can get this data because a lot of data these days certainly in the UK is what's called open data. The UK government departments provide an awful lot of data free. You know which you can download and ingest into your own systems but you might also want to buy data in from people like done and Brad speak for example, or another organizations like that, but that needs to be part of your business case because that costs money. And then you're creating your creation or ingestion processes. I think importantly as well. It's very important to conduct an impact analysis so if you introduce this data into your organization. Could it impact on data processing that's already going on elsewhere. So as you can see there's quite a lot of things to think about you and on the top right. You'll see there's some of the challenges that organizations get when they do this very often they fail to do this very well. One of the reasons they fail to do it very well is you know nobody owns the data at this point and this is where governance comes in and throughout the whole of these, you know the data management life cycle. Governance is absolutely key because, you know, even at the before the point where you start to make a decision about do you collect this data and process it or not. Who owns this and who's accountable for it. And that's where data governance is critical and I'll come back to that later. And if you don't do the checks you know you can you can be creating data duplication. You know how do you know the data is going to be fit to the purpose. You might be told that this data is just what you need when you actually see it and process it. It's going to be a crock of, well something unpleasant anyway that isn't what you expect. And you know maybe this data doesn't have good metadata with it so you don't really understand it in terms of its format or its content. And across a key thing as well that if you don't involve all the people who might be further downstream using this data. What is the impact on them. I believe it all company, but you can see in the example I gave with my mum, that whoever defined those data entry processes I don't think even thought about the impact on downstream data processing. So if you work in a call center if you don't get the address completely right, not a big issue in the call center but it's a blooming big issue when you then send engineers advance who can't find addresses, or you send a fire engine to the wrong address. So you can hear these things are the challenges so so how do you overcome them. And again I'm just going to highlight a couple of things because of time, you know, get you know a point this data owners and data stewards right at the beginning. And they should be the people that drive the business case and they should also be the people that determine how and how and why and when this data should be acquired created or ingested. So it's a job as well to identify who are the key data stakeholders going to be, because we on our own cannot establish if this is fit for purpose because if other people are going to use it. They should have an input to into what this data needs to look like. One of the things you know we would always say and and you know you saw it earlier on about about the importance of data profiling. You know you need to profile whatever data sources that you think you're going to acquire or will get your data from to make sure that they are fit for purpose, and then you know consider even at this point before you collect any data. What data validation methods and rules you're going to apply to that data so you know are you able to use reference data, are you able to create drop down lists to make sure that people cannot enter data incorrectly. So probably as I said the most important stage of any data management life cycle, and I would bet you $100 now that this is the bit that many organizations tend to ignore, or they don't do it well enough, which is why so many problems then occurred downstream. Moving on quickly to some of the other areas, and I will be quite quick on this. So once you've decided this is this is new data that you want to collect and I'm using the example of new data here because it's the simplest one. Then you need to actually then start to do some detailed design as to how you're going to collect this data, and then actually building the processes in order to make that happen. And then of course in the sort of things you'd expect to be doing here are, you know what are the data entry types are you going to get users to input it. So a bit like the contact center in BT. Can you generate this data from devices or other applications or IoT, the internet of things, for example, you need to decide well start to think about where is this data going to be going to be stored and processed is it going to be an existing warehouse or data lake. Or are you going to build something completely new. And then of course start to design as well your source to target mapping. And then in terms of, you know, data quality challenges again, not dissimilar really you know when you start to collect and ingest this data very often you know you'll see you'll get errors in manual data entry if people enter data they will make mistakes whatever you do. All you can try and do is minimize those mistakes, and also as well of course sometimes the data may be incomplete or inaccurate, which is why data profiling which I mentioned earlier is a really important thing to do. Because it's you know you may get a test, a test data source from a provider and you may look pretty good and you may decide that's good enough, but then when you actually get the operational data coming in is riddled with problems, maybe incomplete or inaccurate. So you've got to keep on top of that. And you know in terms of what good data quality things to do at this point. And this is why Donna said earlier on it's not just about data quality. We would always say this is point where you recreate some data models. So you can actually understand the relationship between the entities that you're gathering information about, and about the attributes as well. So, you know, doing things like conceptual data model or a logical data model at least that will then help you to get some definitions together, help you to develop some rules. And you should start thinking as soon as you start to collect this data to populate your business glossary or your data catalog if you have one. And then you can when you start to process the data design the detailed DQ rules, and enforce those on data entry. All these things are really important, I think, and you must monitor your sources regularly. It's no good just to do it at the beginning and assume it's okay. You've got to set up something systematic and regular to make sure that the quality that you expect or want is maintained. And then, you know, this is also a good point to start to create a data quality dashboard, for example, and think about how you're going to measure the quality of the data as it comes in. One of the things I think it's pretty obvious already is as you go through different phases of the life cycle, then your measures will change because your measures on collection and acquisition of data are going to be different from your measures as to when you publish data or share data. So you need to think about how your data quality measures need to vary according to the stage of the lifecycle that the data is in. And then from there you go to, you know, preparing to store and maintain the data. So this is where you have, excuse me, this is where you have to define your storage policies. You know, you're going to encrypt the data because it might be personal. You need to think about the security of the data and the privacy policies, for example, and critically as well here. You know, design your ingestion processes again in more detail. And that said, do that integration then of data into existing data stores or create new data stores. Lots of data quality challenges here. One of the things that we find in many organizations is really the third challenge you see there. Where you start actually doing the data transformation from source to target is very often the first time people think about data quality. So we've got to ingest this data into the warehouse. We've got to build some ETL routines and processes. Let's think about data quality. Based on what I've said already, it's too late because at this point you were in the sort of the 10 of the 110 100. And therefore if you make changes now, then they're going to be a lot more expensive than if you'd anticipated those in the first place. And you know, some of the things you can see there, one of the things we would always strongly recommend. To start to introduce it is something called a DQ rules engine and we'll talk about that in a minute. The other key thing again is back to stewardship of data. Then, you know, these are the point where your data stewards should be really actively ensuring that the data that you're preparing, storing and maintaining is ultimately going to be suitable for all the data consumers that need to access that data, not just for one or two, but for all the people that are going to become the end users of that. And that keeps data quality focused as well. And I know it's something that Donna holds dear to our heart, which is that if you, you know, if you go to invest in a data catalogue, you need to think about how you automatically update that data catalogue from your, from your data processing activities, because if you're going to rely on people to update it manually, guess what happens, it soon gets out of date. So that's the prepare stage. Then you go on to the use and process stage where, you know, you start to think about, you know, using the data and activities and then these are sort of regular things. I'm not going to talk much about this because I'm sure you're all very familiar with this. So as the data is being used and process, you know, you need to actively maintain the data, make sure the people are sticking to the policies that you've created for that data, you know, and you scheduling your data ingestion activities and all the rest of it. And it's at this stage very often when you start to test how whether the data is fit for purpose. You know, if you're not involving at this point your ultimate data consumers, then they can't really test whether this is going to be fit for purpose or not. And the other thing is they do get get to test this. And I've seen this happen as well. Well, they do report there are going to be some problems if you publish this data, but then nobody's responsible for taking action, which is again why data stewardship is so important because it should be the appropriate data steward or stewards that is responsible for tackling those data quality issues and making sure they're fixed before the data gets used properly. Your studio needs to consider things like MDM. And I mentioned again the DQ rules engine. I've mentioned that a couple of times. If you're not familiar with what a DQ rules engine is, it's basically an application that sort of sits in the middle of an enterprise. And what it does is contain your key data quality rules. And the advantage of having a central rules engine in that sense is going back to the BT example, for instance, you can apply those rules on data input. You can do that in real time. So that means that if somebody's typing in addresses, for example, then the rules engine will say that's not a valid address, you can't input that. Use this dropdown list to put the zip code in. And I'll provide you with the correct, you need to be provided with the correct address. You can then do it as well on your source data sources once the data is in there. You can do it when you're in developing your ETL. You can do it in batch validation through your warehouse and through your data maps, and also of course in your reporting layer. And the advantage of doing that of course is that A, this helps data lineage that I talked about earlier. The other thing of course as well that if you do decide to change a data quality rule, then that changes should automatically be applied to all those applications through from data reporting. In other words, right across the data management life cycle. So a rules engine, if you're not familiar with them, is something worth thinking about. That's an example I think of where, and this again is from that James Cobius survey, which is the one thing we are noticing and this survey confirms that the people are now using a lot more tools. And although the rules engine isn't mentioned there specifically, it's sort of implied in things like data integration, data engineering. It's implied in data cleansing, augmentation, master data reference management, for example, are all things that might access the rules engine. And the headline findings about you can see more tools being used, just to reinforce what Donna said earlier as well, that a big driver now for better data quality management isn't so much operational efficiency, but still important. It's really about analytics and intelligence. It's better decision making, making better use of the data to ensure that show the organization is making data driven decisions are not relying on guesswork and certainly not relying on poor quality data. The other thing I think that this survey showed is that only 16% of the people surveyed and there were over over a thousand people surveyed. They currently have a date and enterprise wide data catalog. 45% of the people intend implementing one within the next 12 months. So I think if you look at sort of hot topics in data quality and data catalogs are definitely something to think about because they are improving and they are becoming much more useful. Anyway, back to the safe. Oops, sorry. It's like you put your finger there. Then you've got the share and publish stage. So this is the point now where you make the data available to the users for their BI or analytics or data visualization activities, whatever it is. And there you can either, as you know, provide pre-canned reports to people or you can, as Donna said earlier, enable self-service reporting if you choose to do that. And the point that the users start to use that data is the point to start to think about how you generate and maintain the metadata that those users generate as well and making sure that that is correctly analyzed and processed. And then you've got your continuing data quality challenges. One of the key things I think it's important at this stage, and it's something that a recent client that we've worked with is very keen on, is they do a load of pre-canned reports for people using Power BI. One of the key things they've done is to publish a report catalog. So basically all the pre-canned reports are listed somewhere where people can look at them so that people don't need to feel they need to create new reports. They may be able to use data from one or two existing reports to get what they need. So that's a good, a lot of data quality as such. It is a good aid for data quality. You also want to give the users DQ dashboards so they can see how good the data is that they're relying on. And also, of course, if they spot errors, making sure those workflows and mechanisms are in place. And again, I mentioned their automated data catalog update processes, because as the metadata is generated, you want to be including some of that metadata in your data catalog. So again, you need to automate that process as well. Then when you finally get to the final stage, and as I said, not all data necessarily goes down this route to destruction or archiving. And there are a few key state activities there. And why would you destroy your archive data? Well, I said it's of less use usually in an organization the older it is. And so therefore the one reason for archiving or destroying data maybe, well, it's just time expired. It's too old now. It's no use to us anymore. You know, what our customers did 24 years ago may not be of any relevance at all today, because things have moved on so much. But of course it may be down to legal or regulatory constraints as well. You know, in the UK, for example, there are very strict rules about, about, you know, how long you must keep tax data for. And it could be a specific project then. So if you're putting data into a sandbox for analytics purposes, then at the end of that project, that data is no longer required and then his archive will destroy. And a key thing I think in terms of the data quality focus at this stage, and it's a stage that people again tend to neglect is you need very good data retention policies. And you also need to ensure that archive data is actively stewarded just as much as it is in life, simply because at some future point that data might have value again. So it's important and may not be the same steward that stewards it when it's in life, but also important to steward it when it's our price. So those are the stages basically so just to summarize all this what are some of the implications because I'm conscious of time. I think probably what I hope that's looking at data quality from a life cycle perspective means that the most critical stages of a data life cycle are at the beginning, stage one and stage two. And, you know, we are very keen on the concept of data quality by design. So in your early stages of your data life cycle, make sure you are designing that data to be of high quality, don't wait until the problems hit you, and then try and fix it. And if you do that, hopefully then it'll be fit for purpose throughout its life cycle, and then you can take preventative actions and if you delay it to later as I think I've hammered by now, that it's going to cost you a lot more to fix it and it's going to give your organization a lot more problems. And I think I also said just to reiterate as well, think about the different KPIs and measures to manage data quality that you will need through each of those stages because they want to be the same. So make sure that your measures are also reflective if you like the data management life cycle. And then of course I mentioned data governance quite a few times. And I think data governance for us anyway is absolutely a must have, if you want to get end to end data quality in place, which, you know, because otherwise you know you were you were making lots of short term fixes but somebody needs to be owning this accountability for the data right throughout its life cycle, right from the plan, right the way to the point where the data is archived or destroyed. And, you know, you may not have the same stewards throughout the whole process but hand over accountability is absolutely critical. As you go through the software development life cycle. So to summarize. A couple of key points I hope that we've got across in this webinar. I think, I think we all know nothing's changed that if you're going to fix data quality problems you need to be holistic. You need to think about people process and technology to fix it. Technology can help you to fix problems but buying technology on its own without doing the other stuff isn't going to work. The thing is, you know, think about data quality management from the very earliest stages of the data life cycle. If you start thinking about it, when you've already started to collect data and store it, it's too late. It'll cost you a lot to fix it and put it right. And I think the other thing as well you know obviously in an organization much of your data is more in the more mature end of the data life cycle. So even if you think about problems with the data there, then always think about the life cycle, because if you get a problem with data in a data warehouse so just look at the data warehouse and see if you can fix it. Go back to the root causes look at how where that data is coming from, how it is created or ingested, fix the problems there, and then those those benefits will ripple through to the data warehouse and beyond. So always think about the early stages of the life cycle even if the data is mature. And I think that's really important and with that I said without data, data governance, I think that there's no chance of ever making this life cycle work. Okay, if you found that this this webinar on data quality of any use, you'd be pleased or not pleased to know that Don and I have done two previous ones the first looked at our methodology for tackling data quality problems which we talk about a to the one we did last time because I mentioned DQ rules and business rules quite a bit in this one. We did a webinar last year, dedicated to looking at how you create, maintain and enforce business rules. So, at that point Donna, I'll hand back to you. All right, thank you so much and hope you folks enjoyed something a little different this month hearing from Nigel and kind of his experience on data quality. And if you'd like to join us next month we'll be talking about data modeling which is always a very popular topic and more specifically data modeling that's related to business and business users your conceptual logical etc. And if we go to the next slide just a friendly reminder that we go the next slide that that we do this for a living if you need any help we're happy to help you with services, and without further ado Shannon let's open it up for some q amp a Nigel thank you so much always a pleasure to have you with us Nigel. And if you have questions for Donna Nigel or if you have questions for Eric, feel free to submit them in the q amp a portion and just to answer the most commonly asked questions just a reminder I will send a follow up email by end of day Monday for this webinar with links to the slides and links to the recording, along with anything else requested throughout here. So this came in early. Can you add more detail about the difference between the top down tone and bottom up method. I don't know to be honest I mean I didn't mention top down or bottom up in in terms I assume does that relate to data quality. Yeah, I'll just talk me I do think it's definitely both so I would say and it came up later in some of the chat you know that top down of really understanding what those business rules are kind of looking at it from a business perspective. I think tying this into data governance framework have you know the other question that came up in the chat was like you know data quality thresholds what does good look like you definitely need to look that from a business perspective. And also as well Nigel mentioned, you know, prioritizing what data is important not everything needs to have high quality you know someone mentioned to come social media, and you know analysis maybe that has to be a lesser quality than somebody's medical record right so that really is kind of I would say the top down business drivers that we always start with, but then you want to do the bottom up as well and do that data profiling that's where some of those tools come in, whether it's a purely did quality profiling tool master and you really need to balance off both of them one is maybe the business I've seen both the business thinks the data quality is terrible and it really isn't that bad it just seemed bad because you ran into an issue, or by first I think it's fine until they look at the numbers. And then also, you know, not all data is equal as I mentioned we had one group that did some data profiling and people were very upset. That number was empty 90% of the time and we're going to do a remediation plan and someone jumped in from the business said who uses the facts number anymore. Right, so that was kind of a good balance of that bottom up you look at the data profiling but then balance it with your kind of steering committee so definitely I would say both you can't do either one in a vacuum and they're both kind of do top down and bottom up and then meet in the middle. Thanks, thanks for the sorry is my aim by the way, I think my brains in a fog. Speaking for a great rate for 48 minutes. I understand you mean in terms of data quality as well something I'll add to that is that when you're looking at data quality from an enterprise perspective, then yes you need to have a top down view of what your priorities are as Donna said and what's important, but you deliver the benefits bottom up so if you look at the example of what we did in BT. We developed an overall strategy a data quality strategy and a plan for the whole organization. But then we put the focus on actually real projects at ground level that would contribute to the to that strategy so we might have a strategy to improve the quality of our customer data, which we needed after my mom's example. We looked at all the bottom up projects that needed to be carried out in order to deliver that objective. So yes, I could totally agree with Donna needs to be both top down and bottom up. Eric anything you want to add. No I mean those, those were both good answers. I know there's a lot more questions so so yeah nothing to add. I'm going to try and get it through as many as possible here but I know we've got more questions in time for statistical information with a time element, for example time series with the data lifecycle benefit from a jump from step five to step two or step three for older data. As you mentioned old policy information may have a new life after an extended period of time. Yeah, I think I think I can answer that don't I think I think that is very true. I mean, you know, the policing example is a really good one and I don't remember the details of it now but I know LAPD had records that went back 50 years or so. You know, archived in dusty places or some of them were stored electronically. And they were basically, you know, a load of records of where previous crimes that occurred. And for the as far as operationally the LAPD were concerned that data was really of little use to them, but luckily they didn't destroy it. And then somebody came along from UCLA and said, can we have a look at some of this old data because we've got this idea this hypothesis that if we could do some data science and analytics and all these old cases. If we could record where crime occurred when it occurred and what sort of crime it was, then we might be able to use that data to do some predictive analytics. In other words, if we know that, you know, around about this time on this street on this corner on this day, crimes tended to happen, then we can predict that maybe a crime might happen there next week. And so what the LAPD did with the great, a little bit of cynicism is first say, yeah, go and have a look at him give it a go. So they tested it out and actually it did improve crime resolution and crime prevention quite significantly. And amazingly, you know, the policeman would sit around the corner saying somebody's going to nick a bicycle here. Well, not always, but very often people did come along and nick bicycles. So, you know, there's a whole set of case studies in life you're interested in it, and it's well worth to read because it shows the data that maybe you think is of no use anymore suddenly becomes extremely useful because you found a new purpose for it. Fantastic. Anything else you want to add we got just a few seconds left now a lot of time. Yeah, I'll sum that up well. All right, well thanks to all three of you for these great presentations. Thanks to Clebra for sponsoring this today's webinar and helping to make these webinars happen really appreciate it. And thanks to everybody again I will send a follow up email by end of day Monday with links to the slides and the recording from today's presentation. Thanks everybody. I hope you all have a great day. Thanks y'all. Thank you. Bye.