 Hello, and welcome. My name is Shannon Kemp, and I'm the Chief Digital Manager for Data Diversity. We'd like to thank you for joining today's Data Diversity webinar, Data Quality, Data Engineering, and Data Science. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. If you'd like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the upper right-hand corner for that feature. For questions, we'll be collecting them by the Q&A section in the bottom right-hand corner of your screen. Or via Twitter, using hashtag dataversity. To answer the most commonly asked questions, as always, we will send a follow-up email to all registrants within two business days, containing links to the slides, the recording, and anything else requested throughout the webinar. Now let me introduce to you today our speakers, Tom Redman, a.k.a. the Datadoc at Data Quality Solutions, and Prashant Southakal, the Managing Principal at DBP Institute. Tom helps companies, including many of the Fortune 100, improve data quality. For those that follow his innovative approaches, enjoy the many benefits of a far better data, including far lower cost. Tom's data-driven profiting from your most important business asset published by Harvard Business Press in 2008 is the guiding light for companies thinking to build their future in data. Tom started his career at Bell Labs where he led the Data Quality Lab. He has a Ph.D. in statistics and two patents. Prashant is the Managing Principal of DBP Institute, the data monetization firm which monetizes business data for insights, compliance, and customer service. He brings over 20 years of data and information management experience, consulting, working for companies such as SAP AG, Shell, Apple, P&G, and General Electric in North America, Asia, and Europe. He has presented scholarly works in IE journals and conferences, and Dr. Southakal published three books including the recent data for business performance. So with that, let me turn it over to our speakers for today to get today's webinar started. Hello and welcome. Thank you very much, Shannon. That's a very kind introduction, and I'm very much looking forward to this webinar. The... I say that, and now I... Oh, here we go. We're at a really, really interesting time in the data space. I mean, let's face it. For most of us, being in the data business has been a tough career, but it's finally good to be a data person, right? There's a lot of good things going on in all areas. At the same time, progress in many respects seems a lot slower than we think it should be. And we need to speed up, and one of the ways that we need to speed up is to really challenge each other very, very hard in the end play, in particular, of what works in the practice and the framework and being very, very broad, understanding where businesses are going, how we can help them go there faster, where we need to vector them, where we need to build more data in. And so I'm pleased to be working with Prashant for the last couple of years. We really got into this habit of challenging each other very, very hard. Some days I get off the phone with him, and I think, oh my goodness, I've been too brutal, and I think every now and then he may feel the same way. So what we thought we'd do in this webinar today is we'll look at this locus of data quality, data engineering, and science. At the union of those things, we'll have an open-ended discussion and we'll try to push each other very hard. Our apprentice, we kind of said, okay Prashant, let me ask you the five questions I really would most like to ask you and you asked me the five you'd most like to ask me. And so that was sort of the structure of this presentation. And as we go along, we'll feel free to challenge each other right from the beginning. One quick shout-out, and that is, I'm sorry, the slides just went away. One quick shout-out, and if you haven't seen this thing called the Leaders Data Manifesto, I just put the slide up for the Leaders Data Manifesto. I think most people in the data community have seen this by now, but please, if you haven't done so, go to www.dataleaders.org and take a look. This manifesto is the result of a collaboration by people whose names you see in the lower left of the slide, Laddling, McGilvery, O'Neill, Nina Evans, and Jim Price and myself to really get at this question of why are we moving so slowly and what do we need to speed up? It's less than a thousand words. It's the best summary I know of of what anybody, no matter who you are, how you can contribute to things speeding up. So with that, let's charge off. Prasanta, I believe you have the first question. Yeah. Hi, Tom. Thanks very much for the good introduction and same with you, Shannon. It was nice to talk a few lines about me. And thank you, everybody, for attending this webinar. So while we have something to share, we are also looking at your questions and thoughts so that we all learn from each other. So, Tom, the first question for you is you have been trained as a statistician and who worked in Bell Labs for JP Morgan and other big companies, and you have worked on the data quality for a very long time. Let me ask you a question that tries to merge these two things. How do the quality needs for data analytics differ from the more usual that we see in day-to-day operations? Yes. So that's a good question. I put up the slide. I want to answer that question by first establishing a baseline. And the baseline is looking at, you know, data quality day in and day out. Unfortunately, the story is turning out to be far worse than, you know, we imagined it was even a couple of years ago. I'd like people to concentrate on the first two bullet irons. The best estimate is that 45% of newly created data records have a critical error. There's a particular way. I've done this study. I've done it with some people in Ireland. And we teach a class there and we've gotten a large group of people and many databases, many companies and so forth to peer in on stuff they're using today. And the most important stuff, the newest stuff, and finding out, you know, an incredible range, right, from a few percent to a few in the high 90s. But on average, 45% of the newly created data records having at least one critical error. And the cost is enormous. I mean, you can't do work with errors in the data. You have to fix it up before you can go on. And that's where these things called hidden data factories come in. The IBM published a figure not long ago that said the cost in the U.S. economy of this was $3.1 trillion per year. That's sort of, you know, on the order of 18% of the economy. The 20% of revenue figure that I've cited here is from an experience study. I believe it was conducted last year. So there's some good work going on in Australia, led to by Martin Spratt. You know, this number, about 20% of revenue with the cost of bad data, there's a lot of support for that now. And these two numbers go hand in hand. So the baseline is pretty darn bad. Now, in data science, it's what we don't yet have as good a picture. The, first of all, I mean, the impact may be different. In data quality, in day-in and day-out operations, okay, you have one piece of data that's wrong, so it means you send the package the wrong place or you don't order enough of the blue size medium, V-next sweaters or that kind of thing. The errors are contained. In data science, it may be very different. First of all, you could have errors to pair of errors that, in fact, cancel each other out. And there's no cost. And unfortunately, we don't really know what those situations is. But some of the time, there won't be the same kind of error. But in other times, the situation can be so much worse. And in other words, that bad data can lead to a bad decision or a bad prediction that impacts thousands of people. I think the financial crisis was very much rooted in bad data. And today, one of the things we're really excited about, and rightly so, is this notion of artificial intelligence. And so again, I mean, we really don't fully understand the impact, but at least potentially, the bad data can lead to bad algorithms and then those algorithms just cascade and the damage is potentially unlimited. So that's the most important things. It's also true that we have other problems in data science. We've got different data sources with slightly different definitions that we need to align to do good data science. And we also have the problem of convincing decision makers that the data can be trusted. Nothing's worse than doing a great job on data quality and having trusted data and coming up with some great insight. And then the decision maker says, no, don't trust it. We're not going to do it. So we've really, I mean, I think people in the data community, if you're at all interested in data, you really have to refocus on data quality. The potential, the damage is just enormous. And everything we hold, good and dear in terms of going forward around data science, things could just get so much worse. So Prasant, I'm probably not a little bit. I mean, I now want to ask you a question. And look, no secret you wrote this book, Data for Business Performance. And I have two questions. I mean, the first is, broadly, in your analysis, are you finding things that are consistent with what I said about data quality? And then maybe, you know, can you spend no more than 30 seconds or a minute telling us a little bit more about the book? Yeah, absolutely Tom. So what I wrote in the book is very much in line with what you have said. So in this regard, there are three things which make this book different from other books or what have presented here. The first thing is the book is holistic. So when you look at data, most of the things, when people talk about data, it is considered very synonymous to BI or analytics. But I have gone a step forward. I said data can be a function of this thing. One is, of course, for decision-making through insights or analytics. It can be also used for compliance, for your laws and regulations, for your internal security policies and all those things. And finally, data is also used for operations like for your customer service, whether it's the internal customers or external customers. So when I say data for business performance, it is where I look at three main dimensions. And that's one reason why I believe it is holistic. The second point is this book is written mostly... I just want to get you to emphasize this. And this is something we get lost. People use data every day, right? And sometimes, you know, we're thinking about the future and lose sight of, yeah, well, we've actually got to order the right number of sweaters. And we've actually got to deliver packages to people, right? We've actually got to decide where we're going to drill the well. We've got to decide where we're going to put this, you know, where we're going to put this feature into a software product. So, I mean, all those require enormous amounts of data. So excuse the interruption, but I'm correct there, right? Yeah, yeah, absolutely. So all those points which I've raised is more from a practical standpoint. And this is where this book is more for a practitioner. The statistics, which says that, you know, mostly about the poor quality of data and all those things, that's true, that's very important. But this book talks more about prescriptions. You know, it's great that there is a cost of poor data quality. But what can you do to manage it if you use that number or to make, or you transform your data to a business asset, those kind of things. And finally, we have a lot of books on SAP, on Tableau and R, and so many other technologies. So this book, while it looks at all these technologies, as examples, this is generally a technology agnostic book. So whether you are working in day-to-day in SAP BI or Bob J, or whether you are writing data visualization reports using Tableau, or whether you're writing R scripts. So this book is applicable. So, yeah. One of the things that I never understood, and until we started talking about it, is a lot of people make a big distinction between master data and reference data, you know, there's this data and that data, structured data, unstructured data. I know you explain this in the book. I mean, what's the big deal? Why is this so important? Hmm. Yeah, excellent question. Like people mostly, people try to classify data as structured and unstructured. I don't dispute it. It is true you can classify data as structured and unstructured. But I also classify data from the business point of view into three plus one types of data. The first thing is about reference data, which is primarily about the business category, like which country you operate in, which currency you do your business with. What are the plans in which you have your factorization? So those are the kind of things which are classified, which can categorize your business, and I call them as reference data. Then master data, as most of you would be knowing, it's primarily about the business entities, such as your customers, your vendors, your deal accounts, your even concepts such as contracts and warranties can be pretty much classified as master data. And finally, the transaction data, which I call it as data about, which is pretty much the time series data, which is about business events or business transactions, like purchase orders, sales cycles, even payroll runs, invoices, all those things which are pretty much a function of time, I call them as transaction data. And the common thing that binds everything, these three types of data is the metadata or the data dictionary, which defines, this is the primary key, this is the field type, all those kind of things. So what I thought was from a business point of view, classifying data as reference master and transaction data is more appropriate to if somebody wants to go, somebody wants to get the value out of data. So while I'm not disputed the structured and unstructured data, but I've explained this concept in detail. So why does somebody who doesn't have data in their title care, you still there Prasant? Yes, I'm just, so if you look at the example of a business data, it's not that the reference data or the transaction data or even master data are standalone entities which are sitting in one table in the database. So they all come together like let's take I took a screenshot from an SAP purchase order. And if you look at a purchase order, for example, it has got reference data such as your currency, whether it is order type, which can also, which can all, which are pretty much the build down in when it comes to the core. It has got master data element such as the vendor master, the items. And finally it has got all the transaction data with the numbers, even the purchase order number which is generated when you click on submit or the commit to the transaction. It has got the prices. It has got the quantity. Everything which is a function of time is transaction data. So this is the practical significance of the three types of data. Behind this, there is a metadata of course. So classifying this would help us govern the data better, manage it better and even some of the things about data governance, line of business versus enterprise can help us classify this better. I want to build on this because I think this is a really important point and it has practical significance that all of us really need to understand. I mean, I like to use a lot of analogies that don't have anything to do with data. And the analogy I got out of this slide when you first showed me this slide that came to mind was a pencil, right? And that simple pencil, lead pencil that everybody uses every day and I don't know, 100 of them cost like two hours to make or something like that. But the point was made that the pencil is simple as it is, is so complex that nobody understands everything needed to go into it. There's things about the graphite. There's things about the wood. There's painting. There's every payment. There's putting a little metal clasp on it. There's an eraser. Da-da-da-da. Okay, and so here is about as simple as it gets on this slide in terms of business data. Yet, it's sort of like that pencil. The transactional data came from one place. The reference data came from a different place. The master data came from somebody else, setting up the master record with the vendor. with the vendor underneath this is the metadata, right? So it seems to me that, you know, like to make a pencil, a lot of things have to go right. And what you're really saying is, is that to use data effectively, a lot of things have to go right. And that's why, you know, somebody who doesn't have data in their title needs to know about this. And do you agree with that? I agree, absolutely true. Okay. We mentioned unstructured versus structured. I don't want to spend too long on this slide, but I guess I hear these stats that sort of, you know, only 30% of the structured data is being used, unless the 1% of the unstructured data is being used. And, you know, and it seems to me that then, you know, people are worrying about spending a lot of time worrying about data lakes and on and on and on. And so, you know, like, what's the big deal here? Why shouldn't we be, you know, focusing solely on the most important stuff? And really almost all of that is structured. Yeah, so coming to the definition, for instance, let's sort this out. The first thing is the structured data is when, during the time of data origination, where there is a bridge. I can't hear you now, Prasad. What makes the data whether it's a structured data, a customer rather than a structured data? However, you can even classify this customer as unstructured data by putting it something like a Procter and Gamble Brussels, for example, and call it as unstructured data and classified as unstructured data. But the key point is the unstructured data is about, according to me, of course, is about the taxonomy. So it is a taxonomy, which is going to, which really matters when you want to derive value out of this unstructured data. So, if you go to the next slide. So if you look at the description, for example, which of this item, which you see. So pretty much there is a taxonomy to classify this ball bearing. So, but there could be a kind of an aberration or a deviation from the taxonomy that has defined. And that is where there could be a potential loss of opportunity if somebody wants to derive value out of unstructured data. True, as Tom, as you just said, 80% of the data that is created is unstructured great. So we have now technology to capture this unstructured data. But the point here is that if you don't have the right taxonomy defined so that to harness this unstructured data, then pretty much what you are capturing would might end up as a waste of time and effort. So what I, bottom line is taxonomy and unstructured data go hand in hand. Okay, thanks for that. Okay, yeah. So, Tom, let me ask you a question here. You have emphasized a lot on putting the data towards, like making data practical or putting the data. Tell us more about where the data science fits in the big picture. Okay, yeah, this is something I spend a lot of my time thinking about. And, you know, personally, I think we're at the early stages of a data revolution that, you know, as this fundamentals, people coming off farms and moving into factories, I think that this business of putting data to work is really, really important. And we need to get better at practically everything we do. I mean, we need to get better at the day in and day out work and I've already complained about the impact, the quality on that. You know, we need to get better at management and planning and frankly, I don't know how some people run their companies with the quality of reporting they have now. We need to, you know, get better at predictions so we can do plan. But I think the heart of the data revolution is all about if you work for a for-profit company, it's about putting data to work and largely to make money and doing it in new and exciting ways. And for me, the data science portion of it is figuring out what those new and important ways are. Everybody knows that if you use more data, you can reduce uncertainty and make better decisions. And so, you know, let's just take a typical decision and it's hard to measure uncertainty, but let's say today the uncertainty is 57%. And wouldn't it be great if tomorrow we could make it so the uncertainty is 56% and two weeks from now so that it's 55% to just continually get better at this, right? Wouldn't it be better if we could continually think of ways to informationalize products? And I guess, you know, I think of like Uber as an informationalizer and I want everyone to just think about, you know, what Uber did was they took a process of getting a cab, right? And they combined two pieces of data that hadn't been combined before. One was I'm looking for a ride with I'm looking for a fair fare. And just two pieces of data, and they've upset the entire business, right? And there's plenty of ways to, you know, other ways to put data to work. I don't wanna go through the list here, but I do wanna mention one other concept and that is this thing in strategy 101, right? You know, if you took strategy 101 in business school or whatever, I think, you know, lesson number one would be have something that other people don't. And in the data space, that means proprietary data, right? Stuff we have that nobody else has. And so, you know, sorting out how we're gonna get that stock and we're gonna protect it and we're gonna leverage it. You know, that to me is data science too, right? So data science for me is very, very broad and very much directed at the new revenue kinds of things or at the sort of, you know, big cost reduction kinds of things. Now, Prasant, me back at you, you've talked about, you know, doing data engineering before data science. Oh, excuse me, I failed to talk about this slide, which is my slide. I mean, this whole business of putting data to work. I mean, I guess I think of it as an end-end process and you can see the four steps there, but look, it's not enough to have good data. You gotta do the discovery. You gotta figure out how you're gonna deliver it in terms of a product or service or report and you gotta just figure out how you're gonna do so at profit to do so. And data science, per se, is this big, big stuff in the discovery that goes with any way, as I just mentioned, to put data science, it put data to work. Now, excuse me for getting ahead of myself, but Prasant, I really want to get to this question for you is, is you emphasize data engineering before data science? Okay, can you tell me about that? Why is that? What's so important about this? So, Tom, taking where you left, like the D4 process from which you mentioned in the previous slide. So if you look at this slide, which I call it as the data life cycle, right from the time the data is originated, which might be originated through machines or it might be originating through the person's mind, through the time it is captured in the system, it's been validated process and finally it's been distributed, say for instance, the data is the same, business process is managed in two different systems, two different geographies, how do you integrate it together and finally aggregated by all the BI stuff, cubes and everything. Before the discovery of the interpretation and the consumption, which is pretty much your dollars aspect in the previous slide happens. So if you look at the top two bars, the data security and data storage, which is pretty much I call it as an IT function, whereas the ones in the middle, all the chevrons are pretty much the business function, though IT might be playing a big role in this. So the first thing is, why data engineering is so important? So before you get your dollars through interpretation and consumption, there is a bunch of activities which you need to do right from the time the data is originated, till the time data is aggregated, so that the decision makers or rather the business can interpret the data and make appropriate decisions. So that's basically eight out of 10 steps in the data life cycle are pertaining to the data engineering. So yeah, so that's one reason why I believe the data engineering aspect is fundamental or one of the first steps when it goes towards data science or analytics. So if I go back to this slide for a minute, this thing I call data, it's almost like it's D1. Well, obviously D1 comes before D2, which is the data science. And what you're saying is, let's break D1 down. And these things we need to do to D1, now there's D1, A, B, da, da, da, da, are all these eight things, right? And how well you do with these things places a strict upper bound on how well you can do with later steps in that end-in process. Absolutely. Yeah. So now this leads to the question is data engineering the same as cleaning? In my view, it is not. So the data engineering is more than data cleansing. So based on what I said in the previous slide, I just captured those blue chevrons and put it here right from origination to aggregation. And this whole steps, set of steps I call it as data engineering. And data plus data validation is one part of data engineering is one component within data engineering. So what exactly happens here? What is data validation? Now what exactly data cleansing is? So this is primarily about data formatting. It could be even for unstructured data, it could be even making sure that it's in the right taxonomy. It could be even removing duplicates annually or through a routine. It could be even enacting the cardinology of the data set. So these are all the steps which you need to do as part of the data cleansing activity or data validation, which is one of the fundamental steps of data engineering. So the data engineering is not data cleansing, it's much more than that. And data cleansing is an important pillar within data engineering. I mean, I guess you'd probably agree that if you do a good job in origination, right? You'd like to build as much quality into origination as you can, right? And be doing as little validation and correction as you possibly can, right? I mean, you know, there's organizations that spend 80% of their time doing validation and that's because they do such a bad job at origination, right? And so the point is, is data engineering, you want to be broad with it, right? And you still got to go through these steps, but you want to do them as smartly as you possibly can. Correct, yeah, absolutely. But data origination is an important step. One of the, Prashanth, I can't hear you right now. Shayanan, can you hear Prashanth? I cannot. Prashanth, I don't know what happened. There is a phone number we can get you in on. So Shayanan- Well, tell me you want to continue on the stream of thought there and I will bring Prashanth back and see what I can do there. Yes, our apologies to attendees for this. Let me go to the next slide. And the next two things that we wanted to talk about were sort of around this business of organizing in the right way to do good data science. And so I'm going to split this into two slides. The first is this question comes up all the time and that is, is where do the data scientists sit? And there's arguments about, you know, putting them off on their own or there's arguments about putting them close in the business. And most of those arguments kind of assume a one size fits all answer. And I've not found that useful at all. I think, you know, if this is another one of these form fits function it depends on what you're trying to do. And so the sense of this slide is, is well, imagine a continuum and on one end of the continuum you're looking to do, you know, to do basic process improvements. I mean, you may be simple BI stuff, right? You know, day in and day out stuff on the line. And then on the other end of the continuum is you're looking to make fundamental new discoveries. Whatever your industry's equivalent of the discovery of the Hig Bosson is, you know, you're trying to do that. And of course, this is a continuum. So, you know, I put this thing called new sophisticated algorithms in the, you know, more towards the fundamental discovery. You know, maybe you do credit reporting and you're looking for new algorithms for doing that. I suppose we could also, you know, you're parameterizing an old algorithm and so forth. So, but it's very, very clear that the stuff that in the line, that the basic process improvement stuff, it's you ought to be doing that data science in the line and you ought to be getting as many people involved in it as you can, right? So, whatever your group is, it ought to sit with, ideally with and maybe not, you know, with but as close to people who are doing day in and day out work as needed. Now, you know, to the degree that you're really seeking a fundamental new discovery, you really need a separate organization to do that. You really need to do that in a lab. And there's lots and lots of reasons for that. I mean, it just turns out that the harder the problem, the more forward thinking it is, the more it requires diverse data sources and so forth, more the responsibility to, or the more the need to set up an organization specifically dedicated to that discovery. And then of course, you know, things like new sophisticated algorithms, well, you know, not in the line per se. There's too much discovery to be done, but not as detached from the line as a laboratory may be. The other concept here of this data lab and the data factories, I find that, you know, like, you know, the good old fashioned Edison labs is a tremendous example of what you need to do. But, you know, if you're really looking to, you know, new discovery, new products, new services and so forth, right? You know, you do set that up as a lab and then different kinds of management, a different mindset, different people, different kind of goals and so forth. But the lab is not there to turn a discovery into something that you deliver at scale under great control and at great profit, right? I think people in the lab are less worried about security and how you'll provide a help desk and those kinds of things. And so the analogy is you do discovery in the lab, you have production in the factory and you take, you know, great, great, you take great, go to great lengths to make sure the two are connected. The Edison works that I mentioned, you know, the lab and the factory were on the same campus and in some respects, the spirit of, you know, one building for the lab, one building for the factory, but they're on there on the same campus. So people are walking back and forth continually. So Prasat, I see you're back online, is that correct? Yeah, I'm dead, I don't know if I'm gonna go disconnected, yeah. So look, I mean, I know you've done a lot of work in these factory kinds of settings, right? Obviously not literally factories, but data factories. Can you break that down? Tell us a little bit more about what's needed there. Okay, so it's, you know, there could be many prescriptive things that people might say, okay, to do this, how to derive value out of data, but the four things which are very close to me, which are very close to me or what I have seen in my consulting assignments are, number one, manage your core business process in your system of record, like for example, things core business process such as procurement, such as your HR, so your asset management, those kind of things which don't vary significantly, which is not really your competitive advantages. So manage them in your system of record, which could be pretty much in most corporate companies could be your ERP. Number two is manage reference and master data using standards. Why standards? Because the reference and master data are typically shared across Tom, can you hear me? Shannon and Tom, can you hear me? Yes, you're fine. Okay, good. Yeah, so why standards? Because if you have, the data is shared, so sharing of the data becomes much easier when it is complying to a standard. There are numerous standards, whether you talk about oil and gas, retail, automotive, so many standards that are available. So that's the one which is going to give most value for your buck, which helps reduce cost, which helps improve functionality, which helps you to improve your interoperability and portability of data. So try to go with reference and master data with standards. Transaction data, these are pretty much line of business context. So there's not much you can do about it, whereas reference and master data are shared across the enterprise. So these are enterprise data elements, so use standards. The same taking from us, talking about standards, then use data integration with standards, whether it could be ESB or SOA. So try to go with standards. Now nowadays we are even talking about new technologies, such as the federated data, data virtualization, so which can pretty much help in data integration if you have the right standards. And finally, which most of you would agree, is data governance is not an IT function. A few days back I was talking to one of my friends who happens to be in the business, he's a geophysical. And he told me, hey, I read your article in LinkedIn, you have written something about data. I said, great, how did you like it? Oh, I don't read the stuff which are directed towards IT. So even today, there are many people who still believe that data means it's an IT function, but things are changing. And Gartner recently came out with a study saying that this year, rather in 2016, close to 50% of the Chief Data Officers are reporting to the CEOs. So that means things are changing and data governance is positioned as a business function in more and more companies. Yeah, all about control and the factory. So Prasant, we're getting near the end of our time to talk and time for people to have questions. I mean, so look, we're talking to the data community. And of the things we've talked about here, so what are the three most important things, you'd like somebody with data in their title to really take away? So, Tom, can you go to the next slide please? I'm on 17. Yeah, 17. Yeah. Okay. So what I believe is that for first of all, there are many definitions of analytics. If you just Google the definition of analytics from SAP, it's completely different from Microsoft, it's completely different from what FAS is talking about. So I define analytics as using data to answer business questions. So in that way, so those questions are contextual and they come from the business goals and KPIs are the ones which make sure that the quality of data and the results that you get out of data are making sense. So that's one, my first point. The second important point is, there is no data management endeavor if there is no customer. So find out who your customer is and why and what matters to that customer. So moving to the cloud, yes, just because there is a server utilization from 7% to 6.3% is not, it's not a strong case, a strong business case. And finally, in my view, data quality is not a project. You don't do a six month project to improve data quality. It's a destination, the cultural aspect has got a significant aspect to it and it's a journey. So it's kind of, as long as the business exists, there is a journey. So Tom, what are your, what's your advice to us? Yeah, thank you. I think Prasad, you and I are in the same direction, but I want to, and I just wanna have a greater sense of urgency for people in the data space. This business, you mentioned on your last slide, we're finally getting to where CDOs aren't all reporting into CIOs, right? And but this question of like, well, where should data to report? I think I've been having that discussion over and over again for 25 years, right? And it took four or five years for kind of people to sort it out and compare the experience and so forth. But for at least 15 years, anybody who's thought about it very much has decided that data belongs in the business space. And to me, the fact that we're still having that discussion is indicative of this first bullet item in here, which is the data space is advancing too slowly and it's time for practitioners to push harder. And then secondly, the data practitioners must take on the tough organizational issues, right? I mean, my view of the reason that, you know, so many data groups report into IT and have historically is we allowed it to happen. And, you know, maybe we didn't have the political capital to change it at the time, but we didn't develop the political capital to make it different. So those are my points one and three. And then the second, you know, the last thing I want to say is everything runs through quality. Furthermore, it's the easiest to start. It doesn't require advanced technologies, right? Very little lead time in it. You just, you know, step up as a better customer and a better data creator and find and eliminate the root causes of error. And the money, the organizations that have done this even tolerably well, the money that they save in doing so and the trust in the data even more important, you know, those benefits just simply, simply stunned. So, Shannon, we're ready for questions from the audience and it just occurred to me, Shannon. So look, Prasat and I tried to do this webinar in a different way. And not today, but if people would, you know, let us know, did this sort of back and forth and a little bit of pushing each other, you know, did it work or did it seem to contrive, right? You know, should we be doing a little bit more of this? Should we be doing it a little bit less? Please, please let us, you know, please help us with that. So. Well, thank you guys. Both of you, I can't, I want to speak for the attendees. I think it went well in terms of content, you know, despite some of the technical issues that go in the audience. So, you know, technologies always feel great when it works, thanks for all those who are with their patients for that. Just to answer the most commonly asked questions, just a reminder, I was going to follow up email by end of day Monday with links to the slides, the recording and anything else requested throughout the webinar today. So just to dive into the attendee questions here for you guys, we've got quite a few coming in. You know, towards the beginning or you know, we got a comment that said I would challenge the comment around unstructured data not having much value seems. And then, you know, followed up by that is some unstructured data may be about sentiments of people captured in the comment field. Can you please explain how taxonomy, how would taxonomy help to understand and derive insights from such data? So, I'll take on the first half of that, which is the, you know, the value of things. First of all, I mean, I personally believe that there's enormous potential in unstructured data. The simple fact of the matter is very little of it is actually used to create that value. And, you know, if something's valuable but you're only using 1% of it then a really good next step might be try to use, you know, another percent of it to try to use 2% of it. And then three and so forth, right? And by the way, I mean, too many organizations are focused on making more, making more, making more instead of, okay, you believe this stuff is valuable, you know, and get that 1% up to 2% rather than, you know, just trying to make more. And I'm pretty sure that that advice will hold up. Now, Prasad, can you take on the question about taxonomy? Okay, so the main thing is, if you are looking at, if you're looking at unstructured data and deriving the entire sort of it, the taxonomy is the main thing that's going to define it. Like for example, in one of the projects which I worked for a leading oil and gas companies, we decided to use all their product descriptions using the name and modifier standards and then name, modifier and a bunch of attributes which is defined by standards such as PyDex which is for the, since this customer happened to be from the oil and gas. So it took some effort to bring all the data which didn't have any taxonomy to this PyDex taxonomy structure. But at the end of the day, it was worth their time and effort. And one of the things about data quality is searching. So if data, if you're not able to search for data, however good the data is, if it is not available, it's as good as it's not there. So and searching, a taxonomy can help a lot in improving your searching, in better training and finally using the data to reduce duplicates as well. So Prasad, I'm really glad that that question got asked. This is something for us to think about a little bit more but I think there's a combined answer that says, okay, well I just challenged people to go from 1% to 2% and you said, okay, a taxonomy will help you do that. We gotta put some meat around that answer but I think that could really be helpful to people. So let's try to do that in the next three or four weeks. Okay. What's next Shannon? Thank you for the answer. Yeah, we're gonna have lots of great questions coming in here. Next question is, it's clear that data is an essential element in business strategy. In your opinion, how many companies have fielded long-term data governance programs and why aren't more of them paying attention to this aspect of data management? Long-term? So I, look, I mean, I think that, and I haven't done a careful study but I think most data governance efforts are failing and they're failing for lots and lots of reasons. I mean, you know, look, governance is about control and a lot of things in the data space are out of control but it's just, you know, control's not all that popular. I mean, people get that we wanna create new revenue, right? They get that we wanna make better decisions. They get we need to innovate and so, but you know, well, why we need to be under better control? I mean, it's alone, it's hard to understand that. And I mean, an overall data strategy is gonna put them both together. It's gonna link up, you know, well, how we're gonna go about making more money, which of the ways of putting data to work are best for us and we'll help us make more money and improve our competitive position and then the governance programs could be, you know, a whole lot, a whole lot skinnier around, you know, things going on in the factory and also, you know, then much more linked to something that everybody gets, again, you know, again, making money or, you know, by the way, I say that, you know, make money and I mean, I don't mean to be quite that, you know, narrow about it, there's lots of non-profits and government agencies where, you know, the equivalent is, you know, advancing the interest of the agency or the department. So what do you think for some? So, Tom, the reason, one of the main reasons why I think that the data governance programs are failing or being challenged is primarily because lack of focus. When I say lack of focus is coming back to my earlier slides is identifying what elements we need to govern and that according to me should be the master data and the reference data. There are many projects where I have worked where the data governance team is going and telling the business about how they have to create purchase orders, how to post invoices. So those kind of things. This is not the job of the data governance team. The business knows their job more than better than anybody else. So your job is to have the business succeed with data by providing them quality data and ensuring that the data that is shared across the enterprise is meet and clean. And that comes by focusing on reference and master data not on all the data elements in the enterprise. Yeah, so said different, we've got to do better. We've got to focus more to create greater value and we got to spend less time doing stuff that we shouldn't be doing. It's pretty strong indictment of governance the way it's done in most organizations. Yeah, and another point about data governance is I personally think the security, the data security and the data privacy aspects are generally taken as the last step in data governance. That could be one reason why all the effort sometimes gets nullified due to security and privacy reasons. As we were talking about this, it's like if you do data quality right and you figure out where the errors are coming from and you eliminate those root causes. I mean, it's not the way typical organizations are doing it, but it is so much easier than correcting everything. Correct one root cause and you can save yourself literally tens of thousands of corrections downstream. And this is kind of a theme. A lot of the things we're doing, we're kind of not right on working on the most important stuff and in the most important way. And so the efforts get too big and oppressive and no wonder people don't like working on it. Well, that leads us right into our next question here. What about the data retention and destruction and or archiving portion of the life cycle? There is usually no need to store all data forever, which could pose a risk to your organization. Is this encompassed in another stage or step? Okay, so if you look at the data life cycle, the top two bars, one on security and one on storage is pretty much an IT function. So it should, yeah, this one, yeah, is pretty much an IT function. So data storage, I would like it further subclassified as active data and other one as passive or archived data, which includes all the, depending on your retention policies of the company, whether it's five years or seven years, which could be archived and stored. So which might not have a direct impact to the running of the business operation, but it certainly has implications on compliance and sometimes even on business operation. So that's why I said in the earlier slides, like data for business performance is a function of three things, which is analytics or the insights, compliance and operations. Yeah, so it is the archiving and store and archiving and backups is definitely part of the data life cycle and it is classified under data storage. Yeah, Aishien, I think we got time for one more. I think so, so yeah, indeed, so back to the taxonomy. Why taxonomy creation is not part of your data engineering component? Where does it belong and must it be there before starting a data engineering effort? Yes, the answer is yes. The first thing is that the taxonomy would be pretty much part of the data origination and data capture. When you want to originate data in what shape and form you want to originate and capture it is part of taxonomy, number one. And in today's presentation explicitly, data clients, when Tom asked me the question about how data cleansing is different from data engineering, I covered it as part of the validation. So, and validation involves formatting or the adherence to taxonomy. So taxonomy would be long story short. Taxonomy is part of the validation aspect or the data cleansing element. I love it. Well, guys, we are coming up right at the top of the hour here. Thank you so much for this great presentation and information. Again, just a reminder to all the attendees, I will be sending out a follow-up email by end of day Monday with links to the slides, the recording, and anything else requested throughout. And I'll, of course, send an email or a link for both Prashant and Tom's books so that you can get more information as needed. Thanks, guys. Thanks so much. And thanks to all our attendees for everything that you do. I'm participating and asking so many great questions. We just love the engagement. And I hope everyone has a great day. Thanks, guys. Okay, thank you everyone. Thanks for attending. Thank you.