 Hello and welcome. My name is Shannon Kemp and I'm the Chief Digital Officer for Data Diversity. We would like to thank you for joining today's Data Diversity Webinar Data Preparation Fundamentals sponsored today by TOE by Quest. It is the latest installment in a monthly series called Data Ed Online with Dr. Peter Akin. Just a couple of points to get us started. Due to the large number of people that attend these sessions, he will be muted during the webinar. For questions, we would be collecting them by the Q&A section or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag data ed. And if you'd like to chat with us or with each other, we certainly encourage you to do so. And to open and access either the Q&A or the chat panels, you may find those icons in the bottom middle of your screen. And just to note the Zoom chat default to send to just the panelists, but you may absolutely change that to network with everyone. And to answer the most commonly asked questions, as always, we will send a follow-up email to all registrants within two business days containing links to the slides. And yes, we are recording and will likewise send a link to the recording of this session as well as any additional information requested throughout. Now let me turn it over to Gary for a brief word from our sponsor TOE by Quest. Gary, hello and welcome. Hello and thank you, Shannon. Can you guys hear me okay? Sound good. All right. Well, certainly we appreciate the invitation to be your sponsor today. We sincerely find it a pleasure. My name is Gary. I'm a technical solutions consultant specializing in the information management portion of our business at Quest software. And just to take a couple of minutes here today to talk about what Quest software is, it actually dovetails very, very nicely with what Peter is going to be talking about in about five minutes or so. But I thought I'd give you a really quick overview of what Quest software does. Basically we are in the business of making your IT crew and operations the most resilient as it can be. And we do that whether you're managing people, process technology, whether that technology involves managing data or hardware operations or just locking everything down. We have a number of things that you really want to take a look at, whether you're an IT pro or you're managing the IT part of your organization or maybe you're part of the leadership team responsible for your constituents and your advisors or your owners. We can help you and we do that in three different ways. Three business units here at Quest that allow you to manage the resilience of your IT organization because let's face it if that goes down, your company and your organization probably is going to go down with it. Business unit number one, we are in platform management specifically focusing on the Microsoft portion of that. Business unit number two, securing your IT through identity and access management. Business unit number three, which is the most germane and relevant for Peter's talk today is our business unit that gives you empowerment for data and data life cycle management and governments. That is our information and system management business unit. I've listed some products there for each of the business units, not going to talk too much about products today, but it might be very useful to take a little bit closer look at this third business unit which is most germane to what Peter's topic is today. So let's take a little closer look at the data empowerment and governments portion of our business unit, which is the information and systems management. Hey, it almost doesn't matter what your initiatives are in IT or in your business. And it could involve cyber resilience migration to the cloud. You're modernizing your applications, your application platforms and and stacks. Maybe try to manage or get a hold of or get a take on your sensitive data. Maybe you've got some no sequel initiatives to move data to data lakes or data streams dev ops data ops doesn't matter. We've got something for you on the information and systems management side and we break that down into three different areas area number one. For the data empowerment platform that quest provides is data governments you're going to hear Peter probably talk about some of these things here today like data profiling cataloging of your data metadata management data preparation data debt impact analysis dependencies. All that is part of database or data governance. And most of you will recognize the Erwin name just to let you know quest acquired Erwin about a year and a half ago to flesh out this portion of our information and systems management business unit and that's been a huge acquisition a nice contribution to what we provide in terms of our other solutions. There's a second unit or second area here within our data empowerment platform that is has to do with data operations so whether you're trying to balance the load on your hardware, or maybe you're trying to do optimize hits to the CPU or the database servers with SQL tuning or optimization. Maybe you're currently right now defining or looking into dev ops or even data ops or maybe you're trying to get a handle on your sensitive data and much more we have a lot of solutions that can help you with the operations side of the house. And then the third area that rounds out are, by the way, this is some of the name products that help out with the upside, and you will recognize some very, very ubiquitous names like Toad, for example, or foglight or spotlight or shareplex those products have been around for a long, long time, very mature products that can help our customers literally millions of users around the globe, thousands and thousands of customers around the globe that are using these products to help with their data operations. And then a third area in our data empowerment platform has to do with data protection you'll see some big names here in terms of case or core store or net vault. That is also your main two because you don't want to leave data out in the open right especially in today's hacking environment. So we can help with backup and recovery or software compliance or data compliance or systems monitoring and diagnostics or helping you to optimize the cost of going to the cloud. These products can help you in those areas. So those are the three main areas of our information and system management basically a business unit request software that can help you empower the things that you need to do with the data that forms really the lifeblood of your organization so really that's all the time I wanted to take to give you a quick overview of our company. You can certainly visit quest.com for a quick portal that can take you to a lot of different places I put some resources here that might be very useful and most of these are germane to what Peter will be speaking about today. So this deck will be made available but having said that I wanted to leave a lot of time for Peter because he is our main attraction today so Shannon back to you and to get started with Peter's presentation today. Gary, thank you so much for kicking us off and thanks to Tobay Quest for sponsoring today's webinar and helping to make these webinars happen. If you have any questions for Gary or would like to join us in the Q&A portion of the webinar at the end so feel free to put your questions in the Q&A portion there. And now let me introduce to you the speaker for the webinar series Dr. Peter Akin. Peter is an internationally recognized data management thought leader. Many of you already know him or have seen him at conferences worldwide. He has more than 30 years of experience and has received many awards for his outstanding contributions to the profession. He has written dozens of articles and 12 books. Peter has experienced with more than 500 data management practices in 20 countries and consistently named as the top data management expert. One of the most important and largest organizations in the world has sought out his expertise. Peter has spent multi-year immersions with groups as diverse as the US Department of Defense, Deutsche Bank, Nokia, Wells Fargo, the Commonwealth of Virginia and Walmart. And with that, let me turn it over to Peter to get his presentation started. Peter, hello and welcome. Welcome Shannon and thank you Gary for great little talk on getting us started on this. Shannon is my music still playing I had a feeling I might have left that on. I would have let you know and we're all good. Welcome everybody. That's all right at our feedback crowd will let us know when we make mistakes and all that sort of thing we welcome absolutely all of that. So welcome. Today's a first time we've tried this topic in this fashion here and it's really in response to your requests. So do keep in touch with Shannon, let her know what it is you'd like to hear. Let's talk about data preparation fundamentals and it's an interesting topic in that not many people really kind of get this and I'll tell you through a couple of bits here but let's start out with a very well run survey here by Randy Bean and Tom Davenport. You can see the slides you can see the slide, the URL is very straightforward do vantage calm in there to get years worth of studies that they've done and what they've determined over the years is that organizations are not driving innovation with data, largely they're not competing on data and analytics they're not managing data as a business asset they're not creating data driven organizations, and they're not creating data cultures. They're interesting stats they are actually current as of last year so wonderful pieces here's another piece in the upper corner though you'll notice that when asked organizations responded that the problems that they have around data are largely people and process based. They are not technology challenges so technology needs to play an important role but it can't play an all consuming role in what you're looking into. So we'll look at how this actually works out from an external perspective start off, say that we're going to have to do some amount of data analysis and some amount of preparation in order to get ready for the data analysis. What do I mean by preparation well it's any action, typically it's defined as munging munging you can look it up at Wikipedia, but it's a non destructive act of improving the data in one area. This is probably not optimal in fact maybe an optimal solution would look like this that you could spend 80% of your expensive data analysis resources your data scientists etc. 80% of their time doing analysis and 20% of their time doing preparation and if you had that ratio, you do everything you could to take that 20% down to zero of course it doesn't work that way and worse still if you're just finding out. Everybody else knows that data munging occupies 80% data preparation occupies 80% of the time devoted to data projects around this and when I say everybody knows just because if you talk to anybody in the business they will tell you 100% of the time there's good scientific data backing this up as well. But this is really quite common knowledge inside of the data community outside of the data community when you're planning these things when you're trying to recruit people, or you're trying to manage a project of this type. It is very much problematic and what these Pareto data realities add up to is that 80% of your data is redundant trivial or obsolete. 80% of your data is of unknown quality and 80% of your data is what I call standards free which means 80% of your highly paid analytic capabilities spend their time working under these conditions. This is a problem, because it has always considered data a business problem whereas the business looks around and sees somebody with the title of chief information officer and says, who else would be taking care of my data data has fallen into an enormous chasm between business and it and it's up to us as data professionals to put it back together, reestablish the trust and cooperation. But we have to do that in spite of mountains and mountains of data debt that slows progress decreases quality increases costs and presents greater risks and data debt is the idea of just getting to a normal state getting back to zero. Generally it involves undoing existing stuff with a new set of skills that you may or may not have organizationally, and this leads to a real challenge around the idea that large numbers of businesses report making bad data decisions or using data to get to bad decisions. In fact, I call it a bad data decisions spiral business decision makers and technical decision makers are not data knowledgeable therefore they make bad data decisions, leading to poor treatment of organizational assets and poor data quality, leading to poor organizational outcomes, lather rinse and repeat how does one break out of this. Morgan Freeman, this is wrong is absolutely correct. So the question is if you don't recognize that this is a problem you'll never stop making these concentrations so what I hope I've done is motivate you a little bit to say that data preparation is an unknown, but very important component of what we're looking at here where we're going to go next is data preparation considerations. Data problems are different and spend a fair amount of time in an area I call reverse engineering, which is generally the idea of understanding your data sets or introducing yourself to a data set so let's go into the considerations piece data is not broadly or widely understood I think the blind persons in the elephant somebody thinks it's a fan a snake a tree a rope a wall and of course in the data world people come in through various paths as well thinking that's all there is to data and most do not develop a holistic approach to taking a look at it. Approaches in the past in terms of definition have not been useful either we have said that data management is everything that occurs between when data is sourced and when data is used well. That's useful but not very precise. Also correctly true. So first of all, let's look at it from a sources data management and uses but let's take the word uses and replace it with reuse because if we don't plan to reuse. We have to be able to invest more in our data assets and obtain value from them so that's a big fundamental change right there for organizations here's another refinement at the same topic. The idea is of course there is some preparation and here are some lists of describing amendments and exploitation components on the other side. Let's put this into the reuse context and understand this is where the 8020 ratio of it comes from and that only governance around these topics can actually help us to prevent what has become standard in most organizations. Now, in addition to the not fully understanding the environment and the amount of effort required in each one. Many people look at technology as a one legged stool. It's a be all it'll solve all our problems but of course if you try to do anything with the one legged school, it doesn't work very well, three legs being the minimum operative required. And from a data perspective as well it's also got to be a combination of people process and technologies, but even more so than most disciplines these three are interdependent to a very high degree in this environment here. And identifying winning combinations of people process and technology groups is what leads you to not approach data management preparation as a series of technologies, but as an architectural problem. And that data technology is part of the overall technology architecture that the organization undoubtedly manages. Again, part of the enterprise data architecture addressing at least three questions. What are the technologies acceptable, what purposes are we applying to which circumstances, and in distributed environments how do I move data from one place to another. These are requirements that must be understood before you attempt to go out and survey the market and look at various types of organizations and solutions and things a another good reference around that is the itel standard, which is available online and should be examined but it is a reference model for technology management and managing data management technology is a part of technology management so we should inherit all of the good practices and all the discipline from this well established discipline. And that here they're not real heavy on data in here but nevertheless, it is technology management and so should be adhered to precisely as such. The idea is, what is this technology doing and how is it providing value for the organization. I'll show you an example at the end here of diminishing returns, but the idea is understanding these requirements will help confine our solutions to things like what is the problem and what is this supposed to solve and what sets this technology apart from all the others and are there specific requirements, and does the technology involve data security in any way shape or form all of these are important criteria around this. So, the problem is, there's no standard audience for this and they consequently do not understand as well the need for high performance automation, particularly within the data field, and data tends to be a binary thing as that it works or it doesn't work. There are some in between that happen here as well, and it really involves a combination of the organizational data literacy, it's data supply which is typically on even especially at first and use of standards, as I mentioned before, lightly applied in most organizations and what we have to do to make this data sandwich is to firm up these ideas to apply standards and literacy and supply chain analysis where it makes sense to produce a well oiled machine, and this can't happen without investments in engineering and architecture. And even though it's sort of a sad commentary on the world I actually traveled to this tea farm in India a couple years ago and on the cash register was this wonderful phrase quality engineering architecture products do not happen accidentally and of course we can add the word data to it as well to make sure that everybody understands the investment that is needed in these architectures from a preparation perspective it's again involved in understanding what needs to be done to prepare the data that is needed for later analysis around this. And the reason it's a challenge for most organizations is illustrated best by the story of galloping Gertie, the former Tacoma Narrows bridge just I've given away the results already it opened on July 1 of 1940 and collapsed on November 7 of 1940. One of the famous failures of bridge building and they learned a lot of lessons from this, you can imagine the idea first of all that's a car that you're looking at it's not very easy to see on your screens there, and a dog that was running there and the fellow went back and got his dog out because he was probably not going to endure the way this bridge was rotating on this these are high winds causing harmonic vibrations and of course you know if you take the top of a coke can and simple it back and forth. It will eventually break off as the bridge is right before our eyes this is a very dramatic failure. While data failures cost organization, 20 to 40% of their it budget, they are instead of being obvious and avert like this they are insidious. Tom Redman, a good friend, put together a wonderful term to describe them as hidden data factories. For example, if department a delivers work to department be in department be keeps asking for corrections from department a but doesn't make them, it makes them for them that's a hidden data factory that shouldn't have to be done. The workflow that I just described you includes be also sending on imperfect work to see and see sending that imperfect work to customers, three additional human. Excuse me hidden data factories that are out there where knowledge workers are looking around for for stuff again and spending 20% of their time doing the work this is part of the reason why these hidden data factories take over many parts of our organization and we don't see them. There's another good reason we don't see them as well these bad data manifests itself as different types of organizational challenges. Think about it, in every case data is going to be filtered through an IT process. There's a problem of some sort in order to do this, and only when you connect the dots and find out that these have some root causes, and that root cause analysis is part of data governance they have to be forensic detectives in that sense, to be able to understand and only when you put a team together that understands how to use these issues. Is it a good approach imagine each of these individual data challenges that I show on this slide, being addressed individually by a local work group and buying a specific tool to fix it. It may or may not work on it but it's not going to be most effective. So we look at it going around. Again, keep on moving here to the sort of meat of this presentation here what we have is first of all the definition of what we teach most young people in university we teach them that systems are built by describing them as what how and then as built. In the process of forward engineering or just building stuff, also known as big D design, the what components are typically written down in some sort of a written set of specifications the design components are model based generally and the implementation obviously is implementation and these same steps occur, whether you're doing this in the cloud or on Prem develop it. The idea is that there's still got to be thought and what we do most poorly is think about the requirements aspects of this it's just a habit we've learned over the years. This is of course all we teach young people but the concept of reverse engineering, which I have to say I had a hand in defining some parts of it. What we did at the Defense Department is a structured technique aimed at recovering rigorous knowledge of the existing system to leverage enhancement efforts. All right, every system does some stuff well and some stuff poorly, we don't know what stuff to leave behind and what stuff to bring along. How do we avoid making the same mistakes twice in our redevelopment our evolution efforts. So the idea of moving from as is to design and design to requirements are two specific components of reverse engineering that have important aspects that are just not usually addressed most people discover them. And this is where the tool suites that Gary was describing earlier can become helpful just at the start of this there's lots of other places as well. Again, the whole environment here is shifting from building new systems to building instead on existing systems or migrating existing systems to certain places. So we can't call the process reengineering, unless we first understand the existing system strengths and weaknesses, we may need to bring them all the way back to the requirements the what but we oftentimes going to get away with just going to the how. But if we don't use this information to inform the design of the new system, we are not truly doing the reengineering, as defined by the official standards bodies around us only when we have this new information incorporated into the design. Do we redevelop the system in order to come out with a new set of requirements. These are not techniques that are known throughout the industry they're starting to be taught sometimes in college and university but most people discover them as opposed to our taught that they exist. And the more efficient the world would be if everybody understood their well defined approaches and techniques to this process. Let's take another component here as well which is the word hype. So, whether you're a CIO or a CDO, you're going to feel some pressure to purchase things because they're new and cool, and vendors are very very good at selling typically involves the golf course and nothing in the way of requirements of the actual organization, but it's important for everybody from this point on out to understand what we call the hype cycle. Now, I'm introducing lady, Augusta, Ada King, who we consider the founder of the data industry which means we're only about 250 years old instead of thousands of years old as the accounting industry is. And one of the other many things that she did that she wrote when considering any new subject. There is frequently a tendency to first, overrate what we find to be already interesting and remarkable and secondly by a natural reaction to undervalue the data the case well done lady, Augusta Ada King, she actually gave something to Gartner who refused to claim credit for it here but they put it out in slightly different more technical language they say there's some sort of technology trigger that leads to the peak of expectations, and then almost immediately to the trial of disillusionment up the slope of enlightenment when we finally figure out what we're doing, match time after the original technology was introduced. If you're not familiar with that become so, because Gartner and others provide very detailed examples of this this is as of July of 2021 last summer, about a year ago, data fabrics were about the crash top of the roller coaster getting ready for a ride right blockchain on the other hand looks like it's fallen as far as it's going to go practically and maybe a little further and start coming back data integration tools look like they're getting good at the process virtualization very mature and we're going to come back and talk about cloud and just a bit notice there at the top of the peak of a plated expectations, as well. So a lot of information you can get from this chart as well you can see that the teal colored dots are two to five years away from the plateau in this case and that's the optimization and and the others are darker still are moving even slower so you get some sense of velocity around this as well. Now the idea of what is data management is something we have to pay attention to and luckily have formalized through the effort of a great deal of volunteers at Dama International where I'm privileged to be the president right at the moment. Without this guidance starting in 2009 you were trying to put this together on your own essentially with some good articles and things but it's a it's much easier to look at from a body of knowledge perspective around this and notice we don't specifically make tools but we do call out a series of tools and reference from the guide in here and where these things occur, and we're going to talk about just a couple of them here because we only have an hour to go through this stuff today so we're going to do cloud we'll do case tools, a little bit on etl some data quality and then we'll spend a majority of time on data profiling. Again another Gartner assumption here is that three quarters of all databases will be cloud starting next year now whether Gartner is correct or not is another sense, but some people take the fact that garden makes these predictions as a way to influence the market as well no problem. I hope it will be able to find out next year in fact tune into this very webinar and we'll try to update the information on this. Let's talk about clouds as a technology that for just a little bit most people. First we'll get very excited about it and in fact disappointed if you're not considering cloud. Let's be very clear cloud has tremendous advantages, but money is not one of them if you think you're going to save money by going to the cloud you will be sorely disappointed. You can be able to manage cloud in a way that these linear expansions of scale that don't permit for any economies of scale to get into this. So the way most companies approach the process of cloud adoption is something we call forklifting. They take everything that they've got and they move it into the cloud but there's problems with that mainly. There's no basis for decisions being made whether to include or exclude something it's all in there. I don't include the concept of architectural and engineering concepts with particularly the ideas around sharing there's no idea that these concepts are even missing from the process and remember 80% of your data shouldn't be in there anyway around this so a better way to think about cloud adoption is instead to think of the opportunity of transforming to the cloud as a real transformation opportunity and I say that data in the cloud has three attributes. Data outside of the cloud does not have data inside the cloud should be cleaner. It should be more shareable by definition and by virtue of the first two it should be less in volume than data outside the cloud. In fact, Gartner so convinced data around these concepts that they say they're really only going to be three choices in cloud and they'll not be based on anything that Google Azure or Amazon doesn't be based on the type of data that you're attempting to get to in order to look at this so clouds will become ubiquitous in the near future and anything that you want in terms of data will be your obvious selection of cloud doesn't have to say that you can't do it the other way but if you're just doing it for the sake of being difficult it doesn't seem to make a lot of sense. Another tool, unfortunately that is not taught is the concept of case systems they're mentioned in some of the textbooks at the undergraduate level, but if you're not familiar with it it's a set of software aided computer a software engineering tools or systems engineering tools depending on on how you define it that can help you with the process in fact there's some wonderful ones I've been working with visible systems corporation for many years and they have the idea of incorporating process and particularly motivation or planning statements into the case tools such that you can trace the requirements all the way through to design artifacts it's a wonderful set of technologies I've been using for years was introduced by my mentor Clive Finkelstein. Again, this can keep track of whether some components are in various parts of this, whether you're going to print it out for the entire model or just parts of the model a very nice way of structuring this, even down to including XML based support around concepts and the whole set of case tools is that they have a lot of different ones so you shouldn't just go by one willy nilly, although again, there are some very good ones out there on the market and two main contenders in the area as well. But here's a map that just shows the various flavors of case tools so the idea is, where are you having trouble in your environment and therefore how can we select the right case tool to apply into that. So the changing model of these case tools and you'll see this, the idea that most had to be centered around one piece is now changing to where we can add metadata and other types of technologies to this on an ad hoc basis so that we can use them more in a utility fashion than this and you'll see a variety of different interchange mechanisms that come about as well. The case tool market got so silly at one point that IBM could develop a model to everything and luckily they did provide this up into the public so you can get some old back issues of the IBM systems journal to look at this. But this is of course overkill and completely unjustifiable from an investment perspective very very difficult to say that you would have perfect information on everything although it would certainly make maintenance a lot easier. Around that in most of these techniques, it's actually quite easy to do your own version at first. And I say that because if you're going to look at dropping seven figures on an integrated glossary of some sort, which is the most popular technology right at the moment. You're probably not able to have an articulate conversation with the vendor. And so it really is in your interest to build your own at first and it's fairly easy and most instances here to build your own version of this experiment with it find out how well it solves the problem that you thought it would solve the problem, and then decide whether making a significant investment in that or not is worthwhile so the idea of a repository. You can use a spreadsheet. It's not ideal or optimal, but it has been run before and if you take the spreadsheet and push it out through an HTML page so that everybody can get to the contents of the spreadsheet on an updated weekly basis through a web browser with a simple search function. That's some terrific functionality in this. In order to manage metadata you have to manage metadata repository functionality which means whatever you're doing for data management applies to metadata management as well. Again moving quickly here let's talk about the East Spaces for a little bit there's three flavors, extract, transform and load which is delivering data that's been preformatted to a new database or a data break of sorts or something along those lines. There's also application integration which is making the apps work better together as a family in there and enterprise information. This is a information layer perhaps over top of your app layer. Again, these are just some of the things that you can use at an industrial strength in particular to try and help you with data preparation. It's likely you're going to become aware of quality issues, and it'll be such that somebody's giving you data that they know has some known quality problems on this and so there's sort of four categories of cleansing activities. Sorry, data quality activities is analysis cleansing enhancement monitoring. Think of it as plan to check act again it follows the same process, and there's a fair idea of tool sets in here again just very briefly. The idea that data comes in it's not optimal when you clean it and then group it you can get to the arrangement that you'd like to get to around that. This is a preparation set of tools that can maybe help you with things like phone numbers and I've got a link here to a very nice video by David lotion that's certainly we're spending eight minutes on in terms of taking a look at telephone numbers you'll never look at them the same way again, as well as identity matching resolution tools this is absolutely critical and even more so as you head further into the cloud and into data lake territory. And the dimension becomes even more crucial enhancement tools so that you can add things like date and timestamps or geographic information or merge this with other things. Again, using of course our social security number no he didn't say that. Again using so secure numbers illegal remember, but that's of course what everybody tries to do again enhancing the data that comes out reporting on it just so that you can get a general sense for what's going on. You'll see some examples of that when we get to the more advanced section of this as well. Using portals as a tool I love the concept of data branding and work through that with a number of organizations in fact I'm reading a book about Nokia right now and I see the results of something we did it. We're going to look at some of the recent technologies back in the day that's actually made it all the way through to Nokia so that was kind of an interesting ground circle, but the idea of branded data data that is a good quality or at least have known quality as opposed to data that is of unknown quality there's all sorts of things that you can put in place. And finally here we're going to spend a fair amount of time is on profiling now I say profiling tools. So what I'm going to show you and this can be duplicated with a good to SQL programmer on this so do not think of this at all as I have to go buy something in fact what I'm going to show you cannot buy because it's out of business, but various vendors have it in various components as well. So, when you're coming to a new environment whether you're a brand new data scientist or a new data leader taking over in another organization. You don't know what's in it and you poke around a little bit and get sort of a data inventory type where we've got some of this is just one example from one client that we were working with over the years. Again, then group perhaps like different chunks of it that go into here okay there's great. Let's let's actually looks at some numbers though. Oh, here we go. Wow, what have we got Wow 13,000 attributes 6600 of them unique okay that's a bunch. Let's wait here's another system where we've got over almost 17,000 elements here and another one with 9,000. I can tell you by the way the first one was people soft the second one is homegrown type of things. And the third one was an Oracle financials component in there just as examples that were there. Well, when you're faced with an environment like this and you have that's about it in terms of the information that you have beyond all types of data. There are very few actual measures that you can use to compare across all types of data in there and so it's a very interesting challenge needs a lot more research on this. But where we've done the best was from a PhD named Dean a bit and that use some research money from the Defense Department to develop this set of algorithms called data profiling or sometimes called data discovery or data analysis, and the answer to her was can you speed up software technology analysis by a factor of 10 the answer turned out to be yes. The key to this turned out the bottleneck in the process is generally our subject matter experts that we need to have in this, we call them SMEs, and we need to find out what's going on and the old way you'll, you'll see how it is. It is as complex as it used to be it's easier not easy in this but we can pre assign these pieces and most importantly semi automate the process the word semi is critical because many people think it should be all automated and it's very difficult to do. I mentioned the way we used to do in the past I did this for years for the Defense Department, where you walk into a room you sit down a beamer or a projector on the table, and you say tell me about your business and there's nothing more scary than this fear of a blank screen didn't go blank. I did that, but it's a awful sort of a process. The new way of doing this as I mentioned before is semi automated but we can focus it in a well established process that we're going to walk through at the general terms. It gives you metadata in a repository independent format, and you'll see as we're going through this we are doing quality analysis at the same time it's a wonderful very very powerful technique it's repeatable. It gives you current information that this current as of a certain place but as your data is currently, and it gives you the ability to verify and accurately see what's going on. So these next few screenshots are from a thing called a migration architect, and I mentioned before it does not exist anymore, but these capabilities are available in a number of different tools from the bender community here. So here we're doing first of all, what's called column profiling if you look on the left hand side under the side base piece that we have here let me see if I can actually go back I think my cursor over here, maybe not. Anyway, it shows you on the left hand side there that we're looking at the customer ordering you look under the columns section you'll see customer order goes down a ways and then it goes to employee info which is the next table. On the customer order table the attributes are order date, give me order number order date shift date PO number etc etc. And what you're seeing here is a bunch of information that is gained from looking at a sample of the data. Generally a sample of about 16,000 attributes randomly collected and you have to know a little bit about the data organization. In order to do this will tell you a lot of information about this on this screen I'm just showing you the number of distinct records in the first row of customer order which is order number. There are 197 distinct ones out of 15,000 1573 ratio records, and you can see the ratio is very low in that case. This again gives you some factual information about a subset of the data. If you do further drill down into it you'll see the shipping date for example is supposed to be a date time function that is in first from the metadata. But in this case it is also telling you that it's incompatible because you're using this perhaps as a planning for an ETL process or database migration. So you don't have to look as well and see whether these compatibilities are going to be problematic and you don't want to load the database up incorrectly a bunch of times you want to get it right the first time in order to do this. Again, you can look at the documented types of information it's null is not not null is permitted so it says but it says you know you have nails in there so therefore it's a problem and again flags that is incompatible. Lots of things that the computer can do. Here's another one for a. Home company and they was a home telephone number and they had multiple versions of the home company and notice here when we clicked on row 31. You'll see the cops up at the bottom and shows us there are three instances of 602 7897 200 in order to do that if you've got a master data situation and you're trying to figure which one gets the bills. How do you know so it could be a problem there again I'll show you a more comprehensive view of this but let's take a, an approach first of all for example a question might be. Why is pay code, an asterisk, what is the value of that. Well if I click on that information I can find the actual distribution of it, and you can see that it occurs in 11.49% of the time, and I can double click there and find out a little bit further that somebody said asterisk in that column meant that the pay code was from the United Kingdom and sure enough, the information comes up to see that as well so this is how you can use these technologies to do an interaction and I'm going to show you now. I'm going to show you a video of this, again, videos of model a little movie on this. If you're interested in these let me know I can happily share them with you but I'll narrate it real quick it's about a minute and a half so I'm going to narrate quickly. And of course that's why Shannon tapes all these so you don't have to get all this on exactly the first order so I'm going to click the first piece of this that I'm going to do is I'm going to import the metadata I'm going to read it in from wherever I'm reading it it's the header the database record or whatever else, and I bring it in notice next part I set up my search parameters for the sampling, bring in value in order to do this. This is so old that we used to have to worry about processing time we would set these things up on a Friday afternoon and hope it was done by Monday. That was again pre cloud but nevertheless it is an analysis period so here's the results of the analysis on the employee detail. We have attributes the documented data type and what infers the data type is by looking at it actually. So is there a variance between the inferred and the actual, we have documented and inferred minimum values and then again whether there's a flag on the play. We need that we have documented and inferred maximum values and again we can look at see whether these are compatible depending on what I've set up for my target rules, whether I'm allowed to have nulls or not in this and the number of distinct records. It's also fully assortable so you can do a lot of work in here and you'll notice it actually supports some workflow characteristics. It's selected employee detail we set it to the top. Let's take a look at what's going on here it says it's compatible oh look at this, they're different formats in the record I know how to fix that that's an easy problem to solve. Here's one for employee sex in here let's look at this oh goodness, look zeros wasn't actually formed we should have a problem with that as well. We have employee address lines well it says the null zero take us sometimes you have none in it as well. These are all problems that you can use, and more importantly you can work your way through each attribute of each table that you bring into the systems. Again, in two minutes that was column profiling how to introduce yourself around these pieces. So let's go to the next now. After we look down the values of every row and every column of the database we can now look across the columns and see what changes when what else changes. I know that's very vague. Here are all the possibilities, the inferred dependencies up here on the left. And again you can see that determinant is customer name a combination of customer name and order date, and the dependency is the shipping state well. It sets them all to true for starters and then we have to go back and decide whether they are in fact, real or not. Because the ones that are real we can promote into a relationship in there to say there's an official business rule that needs to occur between these two categories and you can notice differently here now on the right hand side number to order number and PO number. The order number determines the dependence post or post order, excuse me purchase or that's the word I was looking for to pick up on that. So again, what you're looking here is down every combination of every row and every attribute and saying what's interesting about this so here's our second little preview movie kind of thing you hear what we're looking at is the employee Social Security number remember we're not supposed to use this but of course this was back in the day. In order to do this and employees are security number dependence birthday. Yeah, no we may want to think about that one a little bit and you can notice it's found all the possibilities our job as experts in data preparation is now to say what is the actual that we're trying to get to it well. So the dependencies are there the determinants are there let's just take that and look at the various components that are there and say let's just pick one, the employee job code. In this case, so there's a truly job conquer gray by the way is a medium piece. Oh look at this G and F should be at the 70s and they're one and they're not and so let's fix them. Which one's the error, and then we can actually promote that determinants job code determines in this case current title, and we can actually make that a model, excuse me a rule in the model. Let's take a look at the conflicting data that's there. Oh look, 8071 either assistant or a systems analyst by the way this is the best argument ever for letting people define things off of pre list pick lists. And you can see again lots and lots of little minor data quality errors that are in here are really problematic and if you're going to do an analysis of these you of course want to have all of this the same much less if you're going to migrate it from one system to another. Again, we can add these to the model here notice, they've now determined the effects, we push them all the way out to the model in there and notice, notice that the pieces come out on the other side here there we go. Now, locked into that and we can now start to look our way through so again we have an ability we've looked at all these things and said yep, and play ID is going to determine all the rest of these things in there. Let's add these to the model here as well. And in one swoop we can add all of this information up to the model in order to find out and again notice it keeps track of what you've done so that you can keep the workflow going on it as well. And this is a very, very quick two minute overview of what happens but we've looked down at each attribute. And that's the first column profiling and we've known, we now know the characteristics of each attribute the minimum the max whether it can be null. How often it represents the primary key on a percentage basis of what goes on. And now we've taken that same table, and we've now compared every value change to every other value change. And so there's some of those that are real and some of them that are not real. And we want to have those promoted into the environment so we can figure out exactly what we're attempting to do and how we're working with it in order to prepare the state of the third stage of the system is redundancy profiling and this is the idea that the more data I read into this particular tool, the more I will remember it, you can again accomplish this yourself with your own set of programming tools, but let's go ahead and watch a little bit of what happens here so first of all we're looking at the overlap between sub ID which is supplier ID and employee ID, and noticing that there are 16 distinct values between the two. So maybe a problem well if I'm doing a fraud ID and I find out that a small percentage of my employees are actually vendors to my company, and they're not registered as such that 12% may or may not be a problem, depending on what I'm attempting to do. So this is our third little video here again that was looking at record in relationship one we're looking at order records and the ship by ID attribute and comparing it to employee detail and the employee ID attribute in there and notice there's a 39% overlap on those two. If I find out that's the case I can now go to my subject matter experts find out and show either the overlapping values or the non overlapping values or I can show them all again the different ways of configuring with the tool and doing the analysis work so that you can ask the right questions. If you do find them out there we can make them a synonym and all of a sudden I am accomplishing integration across my enterprise by being able to find out in a semi automated way, how these records relate to each other. The more that you read into the tool the more likely it is to notice that you've made these specific changes. Now, let's think about I've just gone over in the past 15 minutes a fairly detailed little component of this is a very generalizable approach to data preparation. It's a very good way of introducing yourself to a data set. And where we used to run into bottlenecks in the past was that nobody felt it was interesting enough to do. So you want this function this capability in your organization housed within one special group. We like to make sure they are part of the data program so that everybody is on board and think of them as like a fire station utility. When I have a data problem, I can bring these set of technologies to bear to the various problem. It's probably not a good idea for you to train all of the knowledge workers in your organization on this, but it probably is a good idea to make sure that all the data people in your organization know that these capabilities do exist and are available to you as a data specialist in this area. Let's compare now the 8020 progress that I told you about this was considered to be a very significant advance when I was doing this for the Defense Department and I worked for DoD for about 10 years as a federal employee doing this. And the idea was that you spent most of your time working with subject matter experts in sessions during the week we called it afternoon and morning sessions and we worked that way and that gave us only about three days of model preparation that we were able to do in most cases on this and many of the Fridays were canceled just because it was again sort of a wearing process. The revised process with the result of semi-automated reverse engineering technologies allowed us to do much more offline and only bother the specialists a couple of times during the week in order to come up with this information. We also developed as a result of this how long it would take to do the pre-date preparation on this. And it was a case line of the system survey and it was basically a component of the amount and the condition of the evidence that you had. So if you had the actual code that went in that was terrific. Sometimes you did, sometimes you didn't. Sometimes the system was written in a programming language that might be obscure, one of the more obscure ones I worked with was the programming language mumps. Some of you in the healthcare facility may be familiar with that. It's a very interesting set of technologies in there. And if you have the ability to add automation to it, then the impact can be generally reduced down. These give you your project characteristics and then you have to compare them against the way you've done this work in the past in order to come up with your estimate. The idea is not that you're going to come up with a right number, but if you start doing this early, you will be able to start getting better at estimating this by understanding the components that go into how long does it take to do preparation. One of the more fun examples of this was that we found a bunch of empty data and data warehouse tables, which meant they could postpone a hardware infrastructure upgrade, which was a tremendous value to them at the time, made a good strong case for normalization. And to their credit, they were really proud of the fact that the Board of Directors for the warehouse actually gave the team a standing ovation around that. They were also able to preserve multimillion dollar pre-sort discounts for the shipping facilities, again, processing all sorts of other components that get along the way. So if we get into the takeaways around this, let's just start off with a very straightforward statement that we're all guaranteed employment for the rest of our life. So says Peter, because the growth between the growth in data and the growth of our ability to provide analyst support for this continues to increase in spite of everything that we have done in order to make this and speed up the process. However, everybody knows it takes 80% of the data analyst time to prep the data. And yet, we still have this incredible growing gap. It's a problem. One of the biggest fallacies we use is a binary solutions fallacy. And that is the idea that we have to either automate something or not. So I'm going to give you a just two minutes or so on diminishing returns because that's clearly where you should invest your resources. If you're looking at a four week project, and you determine and let's just pretend for the sake of argument that it costs $1,000 a week to do this activity regardless of what it is. If I've gone to week four and I've solved 55% of the problem. That might be a very useful set of numbers in order to figure out what was there in the process of course we can learn that we can ignore a small percentage of the data in this case and that the problem space size was generally shrinking as it went through the overall effort working another 10 weeks we've now gotten to the point where we have only 9% of the problem space still remaining to be solved. And somebody could look at this and say I've spent $14,000 because it's a relatively number 1000 times 14 weeks. And is it worth more money to try and gain some additional and somebody said it would be worth $10,000 more for me to get this one more piece of data that they were afraid they wouldn't be able to pick up on their own on this. And so they were able to achieve that in $5,000 that seems like a win. Again, the way we talked about it we also clearly showed that one in five pieces of data that they had was just absolutely rubbish and that we were able to solve automatically 70% of the problem, which meant that instead of having to solve this much of the problem before by this is a real life example with the Defense Logistics Agency. The problem was only 150,000 instead of 2 million. Now the reason that's important on this is because the original projections for how this one to take was a case of there were 2 million national stock numbers NSNs or SKUs as you know them in the private sector, and we just put a measure out at five minutes to cleanse each one which meant we needed 10 million minutes, and we were working 48 days a week times five days a year seven and a half hours a day, 450 minutes a day, 108,000 work minutes in a year, which in the 10,000 divided between the two gave us 93 rounded up person years required in order to cleanse this data. And at $60,000 a year which was the number we were told to use in those days the original plan would have ended up costing 93 person years and a half million dollars for the DLA for this particular exercise, because we were able to submit a revised process in here and only needed to cleanse 150,000 the total time and minutes to climb precipitously from there. And instead of the minutes needed running up there we ended up with just seven person years required to do that smaller 150,000 pile high piece of this at a cost of $420,000 saving the government $5 million. Now that's just the start of this the other part of this I like to do a little social engineering as well. And those of you that experienced in the area know that there's no way any technology can help you cleanse the average data item in five minutes. So if we double this we're at $11 million. And if we go to the real number, which is about an hour per item that we have to look at on this, we can see the number is approaching 10s of millions of dollars in order to do this. So the idea is finding the right place of diminishing returns with respect to your tools don't achieve, don't expect or try to achieve perfection out of them, but find something that works. There's nobody in the government that isn't happy to save $5 million around this process in order to help make your data migration process your data preparation process, much faster, better, cheaper and less risky. The takeaway is on this. We spend too much time focusing on technologies as the solution. The key for your solution is to understand it as a process of systems, people process and technologies, three legged stools that are interdependent, and the better you are at your needs, the better you'll be able to use and evaluate the tool in order to do this. And the value that is there can be gained by the general approach of introducing yourself to a data set under any set of circumstances, with a number of different technologies in order to do this, that the data volume is still increasing and so anything we do in this area is going to help organizations, measurably improve their productivities around all of these. The idea is, let's find existing technologies, let's not depend on things. So use things like the vendor communities will typically provide user groups and forums and places where you're able to go in and share information and knowledge around these topics, in order to come up with the types of things that it makes sense for an organization to actually invest in and use. So the idea here is to really take a step back, and instead of looking at a tool to provide a specific function, find out what is the general class of functions that you use and look at the tool as a component of your data management technology architecture and figure out what parts of your data preparation process that are largely unknown at the moment could benefit from some improvement. When you use those improvements that apply to them, you can try yourself to put in place some technologies that will give you an idea of questions that you can now use to interact with the vendor community and talk to them in the way they'd like to. The vendor community is super educated and super happy to talk to you and measurably excited when they find customers that have really good idea of what they're attempting to do with their tools and are able to very quickly, very easily come up with that information in order to do this. But we're getting at the top of the hour here so I'll just remind you we've got a another set of these coming up we're going to look at conceptual versus logical versus physical data. In the next one here should be an interesting discussion then we'll do importance of metadata and move our way on up the chain on this and we're back at the top of the hour we're going to welcome Gary back to the group and see what questions you guys can have for us around all these topics. I love it Peter thank you so much for another fantastic presentation if you have questions for Peter or for Gary feel free to put them in the Q&A portion of your screen. And just to answer the most commonly asked questions just to remind our I will send a follow up email by end of day Thursday for this webinar with links to the slides and links to the recording. No questions coming in yet but there's been a lot of debate on. And interesting topics. Yeah, go ahead, Gary can I put you on the spot. Absolutely. That's why I'm here. Exactly. So, what we were talking before about the way in which people become aware of the various technologies and the tool set that you sell. How did you yourself become aware of it. Through my long tenure at Quest I've been this is my 22nd year with Quest software specializing mostly in the information management products. However, I got a taste of that all the way back in my first 10 years of the career so I was a systems analyst designing systems for a couple of large mainframe shops and we used a variety of tools which I still see is the common practice today for most of our customers. Literally some of our customers use hundreds and hundreds of different tools to manage their data and the data management life cycle. So things from, you know, case tools to to data modeling tools to data governance tools that perform things like data lineage and metadata and master data management. The list goes on and on it's just, you're not going to find one tool that does everything so short answer yeah part partly of my employment in the first 10 years as a systems analyst and then quest software has done an amazing job of creating some products in house and also acquiring products that dovetail very nicely to their vision in terms of information systems management. Exactly over the years it's a management team that has become known for knowing the space knowing what niches to fill. The reason I brought this up for everybody as the topic, though, is that we and I'm going to put my academic hat on here and say we have failed you in the academic community. We don't even tell you that these technologies are useful I remember the first time I ran into toad I was doing something at a commercial bank in New York City, and had to look inside of something and somebody said well why don't you just use toad and I thought they were kidding in those days toad was considered a strange name. We've gone way beyond that particular piece but I learned how to use that technology very very well. And the fact that quest decided to pick it up and make it part of their portfolios is representative of the kind of thinking that goes on, but it is terrible in my opinion that the university community doesn't educate people who are smart who said, Oh my I'm willing to bet Gary that many people have invented toad on their own and then discovered that it already existed in there. And it's just such a useful utility and that's, you know, one part as you said of a portfolio of these kinds of technologies in there but it just is a shame that we haven't got better in the academic community of saying these things are important when you're out there your productivity is the one thing management can really complain about and really benefit as you start to do this these tools are all about making you more productive as a data professional. Yeah, as a quick follow up to that I appreciate your comments there I used to teach part time as an associate faculty member on the math side mathematical sciences at Purdue University but I think part of the problem there Peter is that technology changes so rapidly that it's really really hard for folks to keep up, even folks that are in the profession to keep up with that that the proliferation of tools. Certainly is the case as an older faculty myself I can tell you that I can barely keep my Amazon knowledge current much less anything else that that needs to go on in there and again it's wonderful stuff, but we should teach them that they're these tools exist. And we may not know exactly what shape it is or size or whatever. But we need to let these young people know that there are better ways it makes us look dumb this is where we get the comment okay boomer although you may not be old enough for that particular comment Gary, but people are like, come on, this is old stuff you guys don't really use cobalt anymore Well you know what, most of the fraud that was done in the PPP program was because people didn't understand cobalt really well. Yeah, I agree. Oh, sorry. Sorry, Gary, I didn't mean to interrupt you. I was just agreeing. Well, you know, speaking of tools, you know the question came in, you know, for what data profiling tools do you recommend Gary from Quest. So there's a couple of names I want you to I saw that question so thank you for calling it out. First of all that you need to look at a product called toe data point that is a client based product that sits on your windows workstation. But the professional version of that product does have a data profiling engine. And it does that engine does supply or present to you statistics on what we see in terms of the values in each of the columns that you're looking at whether it's a table. Or maybe you've created a query and we're looking at the data from that query. We can analyze the quality of the data there the distribution of the data we can give you statistics on what we're seeing the top values bottom values, etc. So that is a product that would be well with your time to take a look at. That's not the only thing it has it does have a nice transformation and cleanse engine to it. It can do cross platform migration cross platform queries if you will to blend different sources together a number of different capabilities that a data analyst. I can guarantee you would love the other side of the coin there is everyone has some capabilities that can do dependency analysis and data governance in terms of data lineage. In intelligent data glossaries and things like that so that's much more of an enterprise level solution rather than a toad data point which is a, you know, a quick client based solution but those two product solutions I would definitely take a look at it in terms of quest. If I can all add on to that my, I actually have in my hand a CD with earn 3.5 point two on it that I carried around with me as a consultant in order to work on various projects I was fully licensed to me and I could do it and I would walk into a place and people I don't know how to do this I don't know what the data actually looks like hook it up to it hook up the reverse engineering engine and spits out a logical physical excuse me a spits out an accurate physical as is model that you can take a look at and it was people would just look at me like I was a unicorn. Absolutely good stuff. Yeah, no I'll just say because your data modeling is one small part of the Erwin stack of solutions but your 3.5 is not current. We're on version 12.x so definitely check it out there's a lot of things that we've added to the product since your 3.x version. You probably haven't put it out on a CD in a long time. Well yeah, guys that's a round thing we used to stick in a player to make music out of before stream. Some people kind of thought it was a cup. Yes a cup cup, what do they call it dish. Oh gosh now we're both spicing on the coaster. There we go that's the word we're looking for. Absolutely. All right well Shannon any questions around. Not currently there was a question about the writing the name of the tool so it's toad data point I'll put that in the chat I just put another link in the chat as well, and Gary put one in there. Earlier, so you can check those out and we'll be sure to include that in the follow up email as well. Otherwise everyone's pretty quiet any additional questions you'll have. Usually quiet group maybe we lost them. Well I think they're pondering again a lot of the discussion has been about tools so it's very interesting. Well you can see how hard, Gary how do you advise organizations to keep balanced in their approach to tool you already mentioned before you think it should be a portfolio of learn a bunch of different capabilities and figure out how to use them. I do agree with my making that a centralized part of the group would be better than trying to disperse it and coordinated across the organization. Well it's an interesting question Peter because I have quite a bit of tenure I worked 10 years in it. And just even in that 10 years and then beyond those years and I'm not working in it as an it professional any longer, but I speak to thousands of customers. I'm working on a 10 year at Quest, literally. And it's interesting to see the cycle so at issue is sometimes it is centralized that they're the Cohen cycles or waves right, an organization will think I don't want to centralize my it organization there's too much backlog, our business units need to be autonomous so we're going to have fringe it departments. And then what happens when when you decentralize an it organization, things like master data management fall by the wayside. You know the tools architect falls by the wayside the one the one person that says this is you know I'm going to standardize on a set of tools. But when things get decentralized now you've got lots of business units, doing their own thing it gets very very difficult to manage that way. So that's what we're seeing right now is a lot of corporations a lot of organizations that we talked to find it very very difficult to standardize on specific tool sets so once they get knowledge of a tool set that will really work for their organization they're willing to buy it they have the budget for it. And but no one, sometimes it's hard for someone in the organization to say, hey, I'm going to step up and say, we need to standardize on this tool. They truly don't know that they don't know is what I find out and this is the frustrating part about it because if we'd at least make them aware, as part of the standard it curriculum that there are ways of looking for assistance out there as has any good program would look at a problem of that type and I'm sure you designed your set of tools when you were 10 years in it in order to come up with a solution to a problem that was in front of you right. Absolutely. So we're a standard part of the way we're wired as it professionals in order to do that and yet, you know, the students should come out of college university and say, Alright, who knows about tools where can I go to find out and you guys actually support the whole concept of forums and communities and have annual user conventions and everything else that supports your community right. Yeah, toad world being one of those that was the first thing I put up there but you're right I mean, at the very least it would be nice, even if you don't touch the product or use them in the classroom it would be nice to have a nice summary to say hey, these are the types of tools that are out there from these vendors vendors go take a look at them, you know, on your own time, or even better, you know as part of the classroom curriculum but sometimes it's difficult to do because the tools tool sets do change in the technology changes. It is difficult for the university professors to keep up with it absolutely. And also, again, it within the university doesn't really want to server set up so that it can have open access. You know, it's just not the best recommended risk posture under all of these but we do have an awareness issue that I think and it carries on to Dama as well. When we survey our members we find that they like you Gary spent about 10 years in it. And we're constantly seeing that at least data if data was better things would be easier nobody's claiming perfect nobody's claiming 100%. If you could do do something better with the data that might make things better organizationally. And yet, by the time you say that three times you then become the data person in the organization, and you're expected to now start providing answers for this you look around and you find quest website you find Dama international website and say oh there's my peeps you know I can join this community it's wonderful to be able to talk to other people that are using the same tools and approaching the same kinds of problems and perhaps have run into a situation like this before I can leapfrog on this this community is very, very willing to share and really the even the banks have been ruled to be non collusive collusive if they are competing at the metadata level as opposed to the data level. Yeah, I appreciate those comments as well. You have no idea how germane or relevant this particular topic is on data preparation when I talked to customers, they the data analysts and there's different titles for these men and women. They are spending a lot of time trying to bridge the gap between raw data and and presenting it to basically the end user or the leadership team so that people can make decisions and consume that data appropriately. So a lot of work that needs to be done there. You mentioned the cyclic nature of it to I'm sorry, Shannon. The, the, if you go on LinkedIn, the number of data engineers and data scientists is kind of like been a horse race past couple years. One will pull ahead of the other for a little while and then the other will catch up and pass it and then go back and forth. It's the same thing you were seeing as the cycles approaching that as well. It's just very very difficult people are not even sure what those terms mean. I've got a group for example here in Richmond that says data engineering is all about mechanized transfer not really focused on analysis components as well but more in terms of optimization and things. Again, perfectly good topic but we've got a ways to go as an immature discipline and we need to work work our way out of it. Peter I mean as you say it's cyclical, you know, we see it all the time and do you think currently you know in the talk of technology and how quickly technology changes. Technology is what's driving the current demand for data prep. I mean we've heard so many companies trying to stand at machine learning and just fallen down because they didn't have any data prep and striving not just data prep but new job titles for data modelers and things like that. Absolutely. It's ironic Shannon that the ML community in particular could benefit the most from the things we're talking about here so in the in the literacy book I made a big point of saying that we thought that the year 2020 was going to be the year that ML ran out of data. And the reason why that statement is still true today is because machine learning has gotten to the point where they are developing what are called learning algorithms. The algorithm says, if I've got the right set of data, I can train this algorithm to learn the things I want it to learn, and thus outperform humans both in terms of faster, better cheaper and with less risk around all of that. And that's a great thought but nobody in the ML community understands where to get that data from and the data that you need is actually the data that we've been talking about for the past hour it is the data in your legacy systems and encapsulates the process rules, the understanding of the various quality applied quality controls that have been applied at various stages, all sorts of hidden things. Gary you've never run into stored procedure hell have you. That chuckle means a big yes. Absolutely. Absolutely. And that for those of you that aren't aware there's a way of burying a rule down in a data set that only fires when the attribute is hit. When the attribute may get hit and then send off something else that you had no idea was going on, because it doesn't show up in your stands nobody's managing it formally there's just somebody somewhere that knows what to do around those topics. Yeah, it's a huge issue and so the ML community really needs to understand the kinds of things that these technologies can help to do and set up because that's how you're going to create the data set the training algorithms that you need to have. So we really realize the promise of what's out there in ML at the moment and I don't see interest I don't see people looking at it and yet every time we apply this in that context people just make light, light, light, what's the warp speed there we go that's the word I was looking for back to my Star Trek days of being able to make leaps forward in that Gary do you guys get into ML at all I said I know you had the profiling engine doesn't work in that context. Not but I am actually in my own way I'm trying to personally push the company to look at that. And I appreciate your comments by the way not to deflect your question we are looking at ML. We, it's not there yet in our products just to be honest, maybe a little bit on the Irwin side, but going back to what you were saying to two things struck me. I think ML is great for certain things it certainly seems that machine language learning, machine learning is really good at taking a look at the trying to intelligent size I guess I could say it that way. The consumerization of the data but the data prep, you're not going to find a lot of ML and I'm not an expert at this right but we need to look at two sides. One is for data prep, and then the other is also for data consumerization so decisions can be made etc. And it seems like it's very much weighted on the, the consumerization side where data is concerned and not so much on the prep side. So, we are certainly taking a look at that it's going to be a long road or nothing's going to be here in the next year, in terms of us as a vendor. But we do see some AI and some ML being very very applicable to what we do and to that end I was going to ask you one last questions from my part. What are you seeing in terms of trends for data ops. There's there's the concept of dev ops on the operation side. But how about data ops. I mean, Eccleston, if you're not familiar with him everybody else has a great webinar where he takes data ops and walks through it in about an hour. It's a terrific piece that the short component of it is if you haven't read the unicorn project or some of the other very fundamental texts in this area and they're not looking at them as an opportunity in your organization, you are having a very important set of technologies that said the data ops component is really about optimizing an existing process. And if you don't understand your existing process you will have no ability to go straight to optimization. So there is a learning curve of getting to there and data ops is an aspirational rather than an actual implementation in most organizations today. So it's at the very peak of that vendor height curve that's up there and potentially available to fall through on it, but we've learned some things over the years with agile we've learned things from demo ops and these are all wonderful pieces of data that can help to inform our process. So does it make sense to take an existing procedure that may be less standardized and look at automating what we can, and documenting every aspect of it such that somebody can come revisit in a couple years and understand what our assumptions were and why. My favorite example around all that is an interesting one where we had a set of data that was on an old disk or an old set of disks and we had to reengineer it into the new one and it had on the old disks the a's through the G's on one disk and the h is through the case on another l's through the o's and you say why was that well the old days the disk could only hold 100 megabits. And so that was what you had to do to chop the data up is there any sense at all and replicating that design. Oh no I get I've got actual documented examples of that kind of bad thinking going into the development of QuickBooks online, not to pick on anybody in particular but I happen to know that little bit about that one both as a user and as a critique of their different methods that they use to not fully understand the problem around that guy I can tell you this one Gary you'll get the same chuckle out of it. There is a concept in QuickBooks server edition of a transaction reference, right. You might imagine every transaction is going to have a complete unique number so that you can look it up because after all QuickBooks online is a very indicated integrated sophisticated database right. It was not available to you online because it wasn't visible as a requirement and the people who were told to document the requirements didn't see it, and therefore didn't think it needed to be in the online version, and consequently have crippled the online version with the inability to do audits in that same fashion. What an insane situation. Sorry I didn't mean to get on a rant on that one but I've actually been on the bad user side of that one QuickBooks if you're listening needs fixing. Yeah I guess it just draws attention to the fact that feasibility study and design is very important. Gary before we drop here do you agree with this 8020 rule that we've got up on the screen. I definitely do and I speak to I think most of my data analyst constituents or customers would certainly agree with this. You know 8020 rule may vary it might be 6040 but it's definitely weighted on the data prep side. And everybody knows this right. I think if you don't know it you probably live it. And yet, we don't start off conversations with them. So if you're if it's going to take you six weeks to analyze this data. That means it's good sorry I better do some easier math where it's going to take you 20 weeks to analyze the data, it's going to take you 100 weeks to get through the whole process. Right that's just kind of obvious math there and yet that's not well understood when people are planning these projects and of course when you see yourself faced with the situation is exactly why you turn to a company like quest. You know the other wonderful vendors that are out there and say what if you got that can help you speed that process up and that's the kind of good conversation that you would love to have with customers. Yeah. All right chair will manage to keep people online I guess that if you don't have any questions guys. We'll call it a day here a little bit early but Gary thank you so much for spending time with us today it's been a joy to work with you on this. And I really appreciate your presentation. A lot of good spots there that really resonate with me and I know also with my customers as well so thank you for presenting that. Yeah, thank you. Yeah, and just again reminder to all the attendees. I will send a follow up email by end of day Thursday with links to the slides and links to the recording those questions about the data ops webinar you mentioned Peter I'll get a link to that as well. I will send that to me. So thank you all and I hope you all have a great day. Thanks for joining us everybody. Thank you Gary. Thank you. Peter, thank you Shannon. Thanks.