 Hello and welcome. My name is Shannon Kemp, and I'm the Chief Digital Manager for DataVersity. We'd like to thank you for joining today's DataVersity webinar, Leveraging Data Management Technologies, sponsored by Trifacta. It is the latest installment in the monthly series called DataEd Online with Dr. Peter Akin, brought to you in partnership with Data Blueprint. Just a couple of points to get us started. Two to the large number of people that attend these sessions, he will be muted during the webinar. If you'd like to chat with us or with each other, we certainly encourage you to do so. Click the chat icon in the bottom middle of your screen for that feature. And for questions, we will be collecting them by the Q&A in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag dataed. And to answer the most commonly asked questions, as always, we will send a follow-up email to all registrants within two business days, containing links to the slides. And yes, we are recording and will likewise send a link of the recording to the session, as well as any additional information requested throughout the webinar. And if you'd like to continue the conversation and networking after the webinar, you may go to community.nativeversity.net. Now let me turn it over to David for a brief word from our sponsor for today, TriFacta. David, hello and welcome. You might have been muted. Can you hear me now? We can. All right. Thanks, Annan. And hi, everyone. My name is David Maximara. I am a product marketing manager at TriFacta. And we're excited to be a part of the Data Education webinar series. We feel the alignment of people, process, and technology as Peter will discuss is critical for organization success in the modern data context. So at TriFacta, we have seen an evolution in the following areas of data management and analytics. There's been a shift from IT-led data transformation to a more collaborative and co-existent approach with business teams and IT. We've seen a shift in the way that people use data, been from transactional data to data that measures interactions and behaviors. We've seen a shift from on-premise deployments to hybrid and multi-cloud deployments. And we've seen a shift in the process from top-down approaches to processes that are more iterative and collaborative. So there is this 80% problem that's fairly well understood. Here's a quote from DJ Patel, former Chief Data Scientist of the United States, saying, it's impossible to over-stress this. 80% of the work in any data project is in cleaning the data. So this process involves accessing your data, discovering your data, discovering the contents of the data, structuring that data, cleaning that data, blending it with other data sets, validating your results, and then eventually pushing that data into your downstream use cases, whether that's for reporting in analytics, data science, machine learning, onboarding that data into an analytics platform, whatever your downstream use is. So the reason that this process is so time-consuming is that the technology that is being used by most organizations is not exactly fit for this purpose. Either IT teams own the data preparation using ETL tools, which creates somewhat of a bottleneck between IT and business teams. Business teams will send their requirements. IT will return with the spec of that data. That process will go back and forth and it will take a lot of time. And each time that business context changes or new analytics need to be derived, that starts a whole new process that takes a lot of time. Other times, businesses use tools like Excel to do data preparation. We all know that doesn't really scale. It's fairly error-prone and it doesn't have great lineage. It's not exactly the purpose for spreadsheets. Or maybe they use code, but again, there are drawbacks there. Code is very technical. A lot of analysts aren't quite proficient in it. Not to mention, it can be frustrating when you build a data preparation script. You see the results, you realize you need to go back, debug your code, start sort of from scratch. And if you have to transfer that script over to another person, it's hard for them to sort of figure out what's going on. So Trifacta fills in this gap by providing the best of both worlds. Trifacta combines visual and machine learning guidance, making it easier for non-technical users to discover the contents of your data, to clean that data, to structure and blend that data. And all of this process that you're doing in Trifacta is stored in a recipe that can be compiled and run on infrastructure in the cloud to run on data at any scale. And these recipes can be orchestrated through data pipelines to get continuous value from your analytics projects. So you're sort of getting that visual and ease of use that you might be familiar with in tools like Excel, but you're also getting the automation capabilities of code and ETL. Don't just take our word for it, Trifacta is some of the largest organizations in the world using our technology, as well as startups looking to get a competitive advantage. And then getting with the spirit of education, we have a new series out of the data school with Joe Hellerstein. Joe is a computer science professor at UC Berkeley. You can check out this series on our blog or YouTube channel, an ongoing educational resource for professionals who work with data, people who work with data systems and managers who define data strategies. And if you'd like to learn more about Trifacta, please visit our website at trifacta.com. And with that, Shannon, I will hand it back to you. Thank you, David, so much. And if you have questions for David about Trifacta, he will be joining us in the Q&A portion at the end of the presentation with Peter. And then let me introduce to you our series speaker, Dr. Peter Akin. Peter is an internationally recognized data management thought leader. Many of you already know him or have seen him at conferences worldwide. He has more than 30 years of experience and has received many awards for his outstanding contributions to the profession. Peter is also the founding director of Data Blueprint. He has written dozens of articles and 11 books. The most recent is Your Data Strategy. Peter has experienced with more than 500 data management practices in 20 countries and is consistently named as a top data management expert. Some of the most important and largest organizations in the world have sought out his and Data Blueprint's expertise. Peter has spent multi-year immersions with groups as diverse as U.S. Department of Defense, Deutsche Bank, Nokia, Wells Fargo, the Commonwealth of Virginia, and Walmart. And with that, let me turn everything over to Peter to get today's webinar started. Peter, hello and welcome. And we're not quite seeing your screen yet. There we go. And you're muted as well. There we go. Got me now. Thank you so much. Hi, everybody. Let me give a special shout out to my mom on her 85th birthday today. And let's dive in here and talk about data management technologies. So really key to understanding this is that data management is one part, sorry, data management technologies is one part of a three-legged stool. And David referred to it earlier as well on that. So what we're going to do is we're going to talk about some technology considerations. We'll talk about data technology, architecture, case tool, repositories, profiling discovery tools, quality engineering tools, lifecycle, and other types of technologies just to give you a very, very dense one-hour session here on what these are. So let's dive in a little bit closer and take a look at some technology considerations, first of all. And the most important thing to keep in mind is that even people who've been in data for a while tend to know only one part of data. Again, like the blind persons in the elephant, somebody who's feeling the ear says the elephant is very much like a span and somebody who's feeling the trunk says it's very much like a snake and somebody who's got the leg says it's like a tree and somebody with the tail says it's a rope. And of course, if you're up against the side of it, it's like a wall. And all of them are correct from their perspective, but when we talk about data management, it's kind of an amorphous subject. So we talk about data management as everything that happens between the sources and the uses of data. Now, let's get to a more refined piece of this, which might be that we want to add the word reuse in here. If we're always processing only new data, then there's never, we're saying consciously there's never any value of reusing data. Of course, we know that's incorrect and that we have lots and lots of value reusing data. So here's a perhaps a better way to think about data and really I think it explains a bunch of problems that we have challenges in the discipline. If we think about sources, there's also a particular set of things we can organize them according to collection, evaluation, preparation, evolution, access and storage. And we call it generally data engineering on the left-hand side. There's not even widespread agreement about that, but that's where it comes in. We've got to make sure the data is stored and then we exploit the data on the other side of it where we have things like data science, data delivery, presentation, data storytelling all out there. Again, these are aspects of data management just like the elephant in there, but the real key is that that's for data use. If we want to talk about reuse, we've got to build a lot more into our program. So everything I'm showing you there in Teal generally does not exist for the most part in most data management programs. And our question from a technology perspective is how can we look at this and see that there's certain ways technology can be helpful and certain ways where technology can't be helpful? And that's what we'll try to cover in this hour here. Also, of course, it's governance. And governance, again, has got to extend the entire way through the data lifecycle on here, which means the governance should apply to tool selection as well on this. Let's think about how it works. I use this analogy of making a better data sandwich. And I got this analogy from a tea farm in India. I'll show you at the end here where it comes from. But right now we know that data literacy is of uneven quality. It turns out the U.S. government has been tracking literacy around these issues in here. And we're holding our own, but that's not good in the face of an increasing bunch of data. The data supplies, of course, of uneven quality and the data standards ability to use within there is also uneven. So it doesn't work as well as it could together. Our job as data professionals is to smooth out these processes and then engineer ways that they will start to work together. And this simply cannot happen without engineering and architecture understanding. So this concept of the domain-specific knowledge is real important for evaluating tools. And I mentioned this came from, excuse me, a visit I had made to India a couple years back where this tea farm that I'm showing you the picture of had at the cash register quality data engineering and architecture work products do not happen accidentally. And they're so absolutely true on that. Just think about it. This is a bit and anything that doesn't work is sand grinding the gears of your organization. So technologies by themselves are just a one-legged stool. When and if we go back to flying again, sitting on a one-legged stool on an airline ride is not even considered remotely safe. You need at least three legs to understand this. And the three legs really do talk about specifically people, process, and technology legs. And while they're legs, they're also interdependent. And it's really key to make sure that we understand that only by working together in the right combinations will these things work. Interestingly enough, however, most recent hole done by our colleagues at New Vintage Partners, Randy Bean and Tom Davenport, came back with a finding that said that the technology part of data is only 10% of people's frustration that 90% of the parts are people in process parts. So once again, keep that in mind as you look at this process of applying technology to your problem, realizing that it's not going to solve most of your problems. There are things that none of these tools and technologies can do, but there's some things that they do very well. Let's start talking about it. First of all, MDM. Almost everybody thinks that you buy an MDM system. It's not. First of all, if you're not in our discipline, people think you're talking about mobile device management because we're a much smaller discipline than the mobile device management market. So they win by sheer size, that's okay. But what we're really trying to do is to say, this is really a strategy, not as a silver bullet on here. And yet it's sold that way and people buy it and it becomes very much problematic. Now the idea of master data is that we simply have a set of constructs wherein when we change the system of record, I was just working with my students a few minutes ago, and if we said if we changed your address record, we wouldn't want to change that in every record or every class that you took. That would be inefficient. We might need to know where your residence was when you took a class, but there still are better data structures in that. And this whole MDM architecture really does, it's being sold as much more of a technology solution than strategy. And so that's the first bit of this, is to apply to this in the idea that this MDM is not going to be a solution out of the box for you, but it's a strategy for people to adopt. And the key is if we de-emphasize the people in the process components, then we're not going to have the proper governance and process architecture components in our architecture and that will make it less likely the tool will succeed. Again, the data will be good, but we've also got to have the tools and methods in here. And there's a very big disparity right at the moment between tools and methods. Right now, stored data is increasing at almost 30% annually and the data workforce is not growing much at all. And so what that really means is the band is here and the supply is here. Now, unfortunately, my undergraduate classes are facing a situation where they have all of a sudden at the beginning of the semester, they had one of the best employment outcomes they were looking at possibly ever in the history and all of a sudden there are 15 million people between them and the jobs that they want. So we want to be careful of this and we've been emphasizing that it's really key to not paint yourself into a corner because a fool with a tool is still the fool. And let's take a look at what that means. We can look and predict. This is something called Moore's Law about where technology is going to be at a certain day because we understand the progress, the pace of progress that it's making. The hardest part of doing requirements is not doing the design portion of it. And what we're trying to do is to tell people that you need to postpone these technology investments as long as you can because however long you imagine you can postpone them, there is going to be a cheaper version of it coming along in the near future. The other thing to be careful of too is that some vendors are very unfortunate and the way they describe what's going on with their tools. I had the CEO of a very large utility company lecturing to my students a couple of weeks ago and he said, you know what my biggest problem is? When I tell the vendors I don't need their technology, they go around me and talk to my board of directors and talk to my boss. It's really hard to explain to them multiple times around this. There's a couple of companies that do what's called vendor project promise auditing, which is the idea that if something is promised, you ought to follow up and make sure it's actually delivered with a novel concept. So in that piece there, there's just very low understanding outside of IT is something called the hype cycle or the hype curve. Now Gartner likes to claim credit for it, but truly it was actually invented by Lady Augusta Ada King, who is also known as Ada Lovelace. Ada was the daughter of Lord Byron and a brilliant person in and of herself. She looked at weaving machines that actually have like computer card interfaces and said, I could do math on that. And people said, I don't know what you're talking about. And she said that at some point, somebody will invent a machine called a computer and it will be able to compute these things. So this is a picture of the actual Bernoulli computations that she was planning to put into the computers just as soon as they were invented. Now she also gave a talk here, and I put it on that first slide, but I won't pass it pretty quick. It was really easy. So when considering any new subject, there's frequently a tendency to first overstate what we find to be interesting or remarkable, and then secondly by a natural reaction undervalue the true state of the case. Notice that is pure psychology and Gartner encompassed it in their life cycle, excuse me, the hype cycle that says a technology trigger leads a brand new technology to something they call the peak of inflated expectations. Then the next thing that happens, the next phase almost inevitably is something called the trial of disillusionment only by going up the slope of enlightenment and reaching the plow of product, excuse me, plateau of productivity. Do we really actually understand how to gain the benefit from this tool? Let me show you how that is used in real time. Here's July 1918, not particularly relevant, but if you were data as a service was at the top of that life cycle, probably not the best thing to invest in at the time. Information stewardship applications were really headed for the trial, not great. And if you're doing master data management in July of 1918, excuse me, 2018, 1918, that would be fun. The things were looking pretty good. They've gotten through their stuff and they're trying to get back to where they ought to be. Again, just a couple more. Database platform as a service was maturing. That seemed to be very, very nice. Here's another one for information governance and master data. Again, if you're doing machine learning in July of 2018, that's great. If you're doing metadata management solutions, not so good. And again, the MDM stuff was on the upswing that you were looking at. One more real quick one, an analytics and artificial intelligence thing. Data storytelling was on the upswing. Prescriptive analytics right at the top, not the best place to be in that year. You can get these. If you have trouble getting them yourself, it's pretty straightforward to call up a colleague in the library at your local university that does have access to most of the gardener stuff and they can get access to it. If we're not staying still, but certainly doing collaborative research with the university is a really good thing. Okay, so with that as background, let's look specifically at some data management technologies. Again, they should follow the same concepts around here as managing data. So the idea is we want to make sure that what we're doing is managing data as data. Now, ITEL has a wonderful library to do this, but they don't really have a lot of data parts in the ITEL library. They also understood some parts of the elephant, but not all of it. Key there is again that requirements in design dichotomy we talked to before, the hardest part of doing requirements is not in fact jumping into your design aspects of it. So it tells you what the context is in the value for it. And the data technology solution to be what problem are you trying to solve with this data technology? What sets this technology apart from any others? Where are there specific challenges around this? For example, I have a 10 page document down to the mouse click level for my students to get access to a secure server with their case tool platform that they're working on at this semester. And finally, is that a security breach that you need to look for? So all of these are critical questions to consider as you're diving into this. The data technology is part of the overall technology. It's often considered as part of the specific architecture enterprise and it addresses what technologies are standard, required, preferred, acceptable, and these are gonna be different answers for different organizations. If you're part of a larger organization, they probably have specified. If they haven't, it's probably a good idea for them to specify because the simplicity will benefit the larger organization. What technologies apply to which purposes and circumstances? Again, just because you have a technology doesn't mean you should do it. The aphorism around this is we've taught students how to use a hammer so they look around for every problem to look exactly like a nail so they can use the tools that we've taught them with. It's backwards. And in the distributed environment, how is the data movement handled? Very critical, important part of the problem. So let's dive into this first one which is the case tools. Case is a subject which, unfortunately, is not taught barely in schools at the analysis and design level. It is, I'm not sure what or how, but it has disappeared from the curriculum. So we tell the students that there's lots of metadata that they need to manage and that somehow they should do it by paper and some of them will run into case tools and ask us about it and we'll say, yes, they do exist but that's not part of the textbook so they don't even learn of its existence. Oh my goodness, what a crime. Case tools have been around for years, decades, generations, long time. The idea is if we have information about the aspects of the system development, in this case a vision for the system, then we can add everything from planning statements to entities, to attributes, to models, to very specific pieces here. And if you haven't seen case tools, I would suggest by all means just Google in them and getting started on it. They have come a long way. We went through a period in time where the market was not looking very healthy for them. It's a true story here. I was with a bank president at one point and they were trying to decide on a metadata management tool and I said to them that the recent acquisition they had just done of a company which was a fairly major acquisition, that this entire industry fit into that acquisition and they could simply add it on as a part of that acquisition. They would not only have all of the technology that was available but most importantly, they would have kept it, the ability to keep it from their competition. They really liked that particular idea. All of these can be done at strategic tactical operational levels. You can look at the entire model or select the pieces of it. And again, the case tools are designed to help you do this. One more picture on case tools which is that they've also gotten into XML and looking at things from multiple perspectives. So again, you might be looking at a component of a DSD that has certain elements and attributes in it that you want to be sure that you're highlighting around the bit. The goal is that the case tool should provide you the most golden source just the same way as master data management provide you with the best source of this information around all this. The challenge with case tools, this is an older example on all this that I'll lead into. Microsoft has their own tools that help out and in fact, the most popular case tool is Microsoft Excel. Second one is PowerPoint. I know that's really a bad state of affairs but it's never less true. What it means is that if you're using anything above Excel PowerPoint and Visio you're actually jumping into newer technologies. Again, both of these have come up. Rational rows used to be the main thing. There's a lot of open source tools but just because it's open source doesn't mean it's free in order to do this. There's a wonderful set of lists of case tools that you can go look through on top of all sorts of pluses and minuses about each of the various case tools. The gotcha part of case tools is this. We start at the top part and say let's again, whether these numbers are realistic or not they were realistic at one point in time. You're going to pay x per, let's say it's $2,500 a seat and I've got 75 developers that I want to have. That multiplies into a tip of the iceberg situation where you need to add another million dollars in training to the whole thing and then in stock support other things that go along with the process. So it's a multiple fact. Boy, I can't say that word. Hang on. We drank a soda and that'll fix my problem. Multiplicative. Nah, I still can't get it. Anyway, multiplicative. Multiplicative. There we go. Multiplicative effects. Sorry, I didn't mean to make you guys go through that. The goal of case tools, of course, is you need to have them not on a display basis like this which is a lot of overhead but to try and find an entry point where you think it will provide a value and then support it as long as it provides value. Most of these organizations are now into renting their technologies and that's actually quite a big help because you can turn them on and off as you need to. Although it's a different subject topic here but I say over and over again that your data management program should be as durable as your HR program. Nobody ever does say, by the way, we're done with HR so let's we've hired our last person. Let's close HR up. We don't need them anymore and you got to get people to understand that data is exactly the same thing in here. It's a programmatic effort. So I'm going to sort of hit you with two final slides on case tools. The first one is a quote taxonomy here just to know that just like every different type of good mature technology there are different strata, different layers, different levels of where it should be. So it is well worthwhile to investigate and see if case tools have the requisite knowledge skills for you to do. Obviously, case tools don't have knowledge and skills but they can do the things capability so we're looking for there for this and try to figure out where it fits in the general scheme of things because it's a fairly big environment. Trends that are working in the model and marketplace are that the old way, if you will, the legacy way is that everything must fit into one specific technology with the addition of XML and special portals and type things. We've now been able to sort of morph that into a good set of integration where we can use case tools as more of utilities to go in and do certain things. For example, some case tools are really good at normalizing data. It's a mathematical process. It's not 100% perfect but I'd rather use it than not use it and it also walks you through a very nice step process to squeeze some anomalies out of the process. So that's a great way of using a case tool as a utility rather than having to necessarily build an investment all the way around. All right, well, that's basically it on case tools. Let's move on now to repositories and repositories are also a very big challenge in organizations. Gartner data again here looks and says what are the challenges for data management practices in here? Well, the real key is that 60% of them identify delivering value. Trying to find data that delivers certain value for you and scoping these is the major thing and also then supporting data governance and data security. And you can see the rest of them wander off down the end. And yet when we look at what's happening on the metadata spend, metadata is really only representing 12% of the time. This was a time in motion study spent in data management but it's a probably pretty good reaction to it. So repositories to get a 12% lifted most people are not really interested in that even though this was written 30 years, 20 years ago, it's still true. Most executive and IS managers view a repository as an esoteric technology that's not related directly to the business. Well, when they're asked that I refer them to the old Wells Fargo model that I was familiar with back from a couple of years back that was so directly connected and so well managed so state of the art managed to their production system that if the metadata system went down production went down and if production went down metadata went down. That's a tremendous achievement with that. I don't know, I'm not affiliated with Wells Fargo so I don't know if they've kept that up but it was a spectacular world-class achievement at one point in time and I get to witness it first hand. It was just terrific. So what's happening when people use repositories? Well, most of them aren't just for starters and many are building their own which is a really good idea. Keeping metadata I've already said is the same as keeping data. So building your own environment for a metadata repository is a really excellent way of learning about what you're trying to get into if you haven't had the ability to do this. Now let me tell another story. I won't tell you which organization this was but it is one that I have some of my retirement savings with so I hope they do well but I know that they have a repository that they purchased so that they can answer a question on a survey. Yes, we have purchased a repository but I also happen to know that they haven't taken the shrink wrap off of their repository because the company that wanted to get them to take the shrink wrap off and actually use the repository that they had purchased didn't. Gosh, what a terrible story. Now let's go back to that 23 percent. All of you out there have a SQL server guru that you can get. If you don't have one that's local to you you can certainly go to the local university and find an A student that's doing really well that needs an extra project and build yourself a little repository. I've been doing this with companies for well over 30 years. The idea is that you build it and you use it yourself and you get smarter about the process and then you'll be able to have a really good conversation with the repository vendors in there. The major categories are really kind of small players around this whole idea and I don't mean tiny but there's a lot of choices. So again, July 2018 Gartner puts out these types of challenges here that show the magic quadrants and if you're up in those magic quadrants then that tends to give you good points as far as being able to match capabilities. Again, I would suggest for all of you that you don't, that you build your own and play with it for a couple of years it's the idea of saying, you know, we'd like to have a pet. Okay, that's great but you also have to take the pet to the vets and feed it and take it outside and give it runs and all the rest of the things is not just the cuteness that comes into it and that's an important set of characteristics for these because again, remember, 90% of these problems are not going to be technology problems they're going to be getting people to adopt the change ways in which they work. A quick show you deviation how unfortunately we got kind of really far out on this which was the idea that IBM published some text a while back if you Google IBM's AD cycle information model they put together the metadata model to rule all mothers. It's a metadata model of metadata models that shows how all these are related and most importantly it's wonderfully developed metadata material that you can use in your projects as a starting point you do not need to start off from scratch. That will be one of the major topics when we get into some of the data design issues that there's lots and lots of templates and patterns that you can use including here that is super useful. Also my colleague Mike Gorman has a little simplified worker version of this as well that is extremely useful just a little quick plug form because it's so useful. So the key here is that repositories don't have to be integrated solutions but it must be an easily integratable solution. That means it's acceptable to use spreadsheets as long as you all use spreadsheets that can be integrated at some point in the future. Having repository functionality does not equal a repository. It must easily evolve into it but it doesn't have to start out that way. This is the pathway that you can take to gradually grow towards that. Again multiple spreadsheets multiple repositories are not necessarily bad and minimum functionality ideas are the way to get into these things. So wonderful ways of checking all of this diving into it a little bit further and learning about it before you go directly from not having one to paying lots of money and trying to install it as a major business initiative. First law of data management in order to manage metadata you need this repository functionality. If you don't have it our whole projects do. All right so let's move into a class of tools called profiling every tool. I have a little set here again. The idea here I showed the results of this before is how much time is spent by data management teams across all of their disciplines. You can see that we quoted the 12% functionality there on that. Another part of it though over here on the left hand side is looking at governance quality and integration. And I consider these three activities to be complementary to the things on the right hand side of the diagram but it gets to the 80% that David was describing in his intro as well. That 80% of the time DJ Patel said is spent there's a word for it it's called munging the data I don't even use the word in the presentations but it's doing all the things that you need to do to get ready for it. And this does this history of technology does have a interesting basis for this. This was money that we funded out of the defense department to Carnegie Mellon University and we asked for questions on how they could come up with ways of describing these things. And they sent the money to a woman named Dina Bitten who I'm hoping to get for one of these webinars that we can put together because she was instrumental in creating the algorithms to form the basis of this profiling discovery analysis techniques and technologies that we have. The key for these things is that they can deliver up to 10 times productivity over manual approaches. And that's important because lessons here clearly are that people's time is much more expensive than the standard machine time that we can put on these things. And we've changed the model of how we used to construct and answer questions about data. The old model used to look like this. We would do old stuff. It would be manual in nature. It would be brute force independent. And it would not really address quality aspects at that point and it would only be done once. And we do it by throwing a projector down on the table and spinning up these models and showing these models as able to represent the business concepts that we have. Now the new way with these profiling tools is semi-automated engineered repository independence. We can integrate quality. They are repeatable. They give you currency and accuracy that goes into these. Let me show you the samples type of results that come from profiling technologies. For example, the old way of looking at something might be to examine a bunch of data. The new way of doing it, we can now look at this pile of data and see that I've circled a little bit above row five. It's actually column five, right? We have a pay code value of minimum asterisk. A minimum value of an asterisk and a maximum value next to a V, victor. Okay, interesting. Sort of going, what is that? I don't know. And normally we would have to go find an SME. Sometimes even get permission to access the SME. I've had some companies where I've worked for where they send minders with me to make sure that when I access the SME I don't upset IT plans. It's an interesting concept here. So we're looking at this pay code asterisk and trying to figure out what it is. But one of the things we can do with profiling tools is that we can double click on that column five and it will show me the values, the frequencies. And again, the value there right under that circle is the asterisk shows up 11. And a half percent of the time, 587 specific instances in the sample data that I'm looking at, but enough to tell us that 11 and a half distribution. Oh, maybe somebody remembers that not only did we put those numbers that were up there, but we can also double click again on those with the asterisk. And sure enough, they pop up and they've all got payment method UK. So what became originally the process of I'm not sure what this is and I'm going to have to go find somebody, ask it and we say, pay code and asterisk, what tables are you looking at? When were you looking at them? And there's lots of questions that go on to all of these. And they show up in this type of a weekly project schedule. When we would do this, I actually had the title once US Department of Defense Reverse Engineering Program Manager. Yes, when we did this, we would take three days a week and prepare in the mornings and we would require the participation of the expensive subject matter experts. By the way, the way you discover your subject matter experts in a technology exercise is you ask the question in this organization is so valuable that you could not do without them. You get them to write those 10 names on the paper and then you turn the paper around and say, those are the only people that are acceptable for my team because they're clearly the ones that can get us through this process efficiently and effectively. And as you might imagine, if you do win that battle, they are expensive people. Here's the proactive approach looking at, again, using these technologies, the proactive approach allows organizations now to spend most of the time not participating with the development that gives the people that are the data analysts time off to look at the model preparation pieces as to all getting ready for it. And we only need the subject matter experts a couple of times during perhaps a couple of afternoon sessions. It's clearly an order of magnitude cheaper for them to occur. Interestingly enough, before they sponsored, I've called out trifacta as a really wonderful set of technologies to do this. It's not officially a profiling tool. I'm sure David can clarify that when we get back to the questions and answers in 20 minutes that we're heading up towards. But I've used trifacta for years and years and I even use they've got a great commercial that shows they call it data wrangling. It's still that same 80% that's out there. So you've got a bunch of tools that are in here that fall into these categories that are really, really useful. There's the visualization that trifacta uses on their site. You know that data is valuable. By the way the road says trifacta.com. So you get a lot of data is all over the place but too much of it can be messy. I'm going to add some other narrative on here too which is that 80% of data is redundant ordered. Excuse me, 80% of your data is redundant, obsolete or trivial. And again, these tools help you to get rid of certain aspects of what's going on there as well. So again, just a video quick but you guys can get to that one and take a look at it. Let's move on and talk about four categories of data quality engineering tools. The categories are basically thinking about it, cleaning it, making it better and just simply checking on it. All right. Again, they divide up into not easy categories. Categories are going to go back to profiling because yes, while you're looking at that 11% of the data that has an asterisk as the lowest value in that column, you can also create a query that goes back and make sure that all the rest of your legacy data is similarly clean and pristine in that same way. You could talk about parsing and standardization, transformation, identity resolution, and mapping, enhancement, and reporting. Again, that's what's coming up on these next little bits. Again, remember we used to do the profiling in an old way. So profiling now gives us a set of algorithms that allow us to do statistical analysis of the data quality values as well as explorative relationships between these value collections. Some of these technologies are so good that you can derive a logical third normal form from any physical pile of data just by ingesting it into the tool and working with it. Again, it is a semi-automated process, not an automated process, but it's still much faster than trying to figure it out on your own. So these data quality tools also fall into parsing and standardization. And the question here is let's look through a bunch of numbers that have kind of groups 301, 745, 60, 350. What are we looking for? Well, gosh, maybe there's a rule that says we're always going to be looking for telephone numbers in here. And you can see there are some examples where it follows and somewhere it doesn't exactly conform. The parsing tool notices the problem. The standardization capabilities corrects and addresses the problem. Most of the profiling tools do the identification piece. They do not necessarily use the standardization piece. So it's a disconnect there. Some of them are more integrated between the two. But we can see as we're looking at these codes that we go through and saying, what do we use these codes for? How are they supposed to work the other kind of part? And all of this information will help you get towards something that you're trying to do which should be driven by a specific articulatable business objective. I gave this lecture to the undergrads the other day because they didn't know what an area code was. Those of you that are young looking at this, that's the 301 part. And we used to have a rule in the United States of America that if a group of numbers were associated with a telephone and it had a zero as the middle of the three numbers, it was automatically and always an area code. And then we ran out of area codes. So we don't have those abilities to do that anymore, but we can at least go back in and look at these parsing types of tools. We look at pattern types of tools. Again, these can take a look and say, there's data. We clean it up and then maybe we want to group it. So we're going to try to categorize it according to one or more patterns. These numbers are numbers that could be telephone numbers because they match numbers that show up on a list of validated telephone numbers. So again, we can go one step beyond this and try to find things that fall into patterns. Interestingly, the development of technology in this area of the business is such that the algorithms are far more developed than the actual practices. And the reason is because somebody, smart people do this, right, will go through and learn a technique for grouping things appropriately. And what they'll do is they'll develop an algorithm that needs to be trained. So the algorithm will be sound, but the algorithm will not be operationalizable until you have trained the algorithm. The training of the algorithm requires good data. And that's the biggest bottleneck that we have right now in artificial intelligence is we have a dearth of good data sets for people and algorithms to be trained on. One final example before I move on to the next category and that is that if you're doing training of visual systems, such as you know, space scanning and things like this, the visual state space scanning and technologies in there are able to do certain things, but are not able to come back and actually give people the kinds of results that they want. And so you can, you're seeing cities now that are saying, while you may do that, we're not going to let you use them here. So again, this lack of data to understand this pattern recognition stuff is something that's really hurting and someplace for somebody to make a very good contribution to the industry. Again, our fourth category of these are identity resolution and matching. One of the more insidious problems I was speaking to is the CDO one of the states right now who was working obviously on coronavirus types of issues and he says one of the biggest problems that we're having and let me just back up on this because this is a really interesting example. One of the biggest problems that we're having is making coronavirus personal to people because if you don't make it personal to people, they don't understand the implications and won't pay enough attention to realize that it's a threat. So his data was coming into him based on healthcare facilities that the data that he needed to have was based on patient zip codes. Now you can imagine that's a big challenge and it's a perfect type of problem to apply these data quality tools for it. Again, you can look at a number of distinct techniques that are deterministic relying on defined patterns or you can make rules that are predictable or even probabilistic. We're just based on the experience of mattress and designers. As I mentioned before, most of these create products that are then turned into trainable algorithms that then themselves become useful in and amongst all of us. Remember, we're just on data quality tools here and we're kind of flipping through them fast. Another set of data quality tool types are what we call enhancement tools and this is the idea that things are coming along and we're adding value to them. One of the most important valuable things that we can provide to data is it's provenant, terrible word or it's lineage. Where did it come from? From where did you get these person data? But you can see I've got other examples of date and time stamps, auditing information, contextual stuff. Where we failed as a society is that when we look at data quality activities and we take resolution to resolve the data quality problems on a regular basis we forget to record our lessons learned from that process. Last type of data quality tools, reporting tools and these tools are mostly what people think of as BI and all the rest of the things that go through but they can also be used quite a bit in tools. So what you want to do is set up dashboards and things that show you or more importantly alert somebody when we get certain things that are not within tolerance. We can touch them early on here. We do not have to go back out and put them in the field and then call them back and say, oh gosh, all the work that you guys did yesterday you need to do it over again because we screwed it up. Again, very brief overview of all of those bits and pieces. Now let's dive a little bit more into some life cycle considerations just to show you what a specific challenge it is. Our good friend Tom Redman put out this original version of the life cycle in 93 with his understanding of the time of what was going on and I think that was a reasonably correct place. You need to acquire data. You're going to store data and you're going to use it. Turns out it's actually a little bit more complicated than that. So we've gotten better as an industry. We now understand that storing data requires us to do some steps before which is to create some metadata, structure the metadata. Then we can create the data. Each of these more complex life cycle considerations also then leads to a mapping back and forth between these portions of the life cycle and the types of tools that you're doing. So there's a good mapping that occurs here. Let me go back in a second though. I'm going to talk a little bit more on this slide. So again, the original version was acquire, store, and use. Good stuff makes perfectly good sense but our understanding is a little better now. We can't just store somebody's data unless we create some metadata. First, a place to hold it and we structure that data. We move it around, move it, organize it. Those data structures can also become intellectual property all the way up and down. It's very, very important types of information. Again, if we, at least in the state I was talking to, they had collected their data correctly, they may have been able to get information to people and save lives. So it does work. Only once you've done those first two activities do you actually create the data. Data isn't stored as it was in the original part. It is utilized and utilization may involve manipulation so it may go back in and be restored. The utilization may also involve assessment which was what we were describing a few prospective reporting and enhancement types of the data quality tools. We may want to refine them and that data could also be restored. Then out of that data assessment, we oftentimes will decide that we need to refine the metadata that a different structure of metadata will better suit our needs. Those are re-engineering considerations. Sometimes they're critical and important. Other times they're not. That's one of the things that you guys are more expert at in your specific domains that are there. When thinking about the lifecycle, understand it's a lot more complicated than just use storage and create. Just the same way as data management is more complex than anything between one data is captured and one data is used. We've got to get more precise about that or it's simply not helpful. Let's talk about another set of the technologies here as well. Whoops, I'm sorry I went too fast and that was really rude. The focus around data integration, we added new pie wedge to the limb box this time talking about data integration and there's some specific tools that allow to be combinations of the things we've talked about before that will actually come into additional types of technologies that can be utilized in here. So one of them are portals and portals even though it's kind of an old subject they never really were utilized to their full potential. What we basically would do is wrap up green screen applications and put them a portal on top of again all that's good but you can take this a little bit further and this is where we see most organizations not doing so. If your portal was fully XML enabled then you can also enable data interchange at the portal level. Now this is another important part of data quality that you can use data management technologies to do one of the things I will often advise my clients as I'm speaking with them on specific problems around this is that they'll say why don't trust the data? Okay well let's figure out what it would take for you to trust the data and let's put data only in a portal where it is a known capabilities. Known levels of quality known lineage again you determine your business so you decide what's better on this. These portals can become extremely powerful in doing what I call data branding exercises which is a great way of selling your data management program to the rest of your organization because what they'll come to other managers and say I want my data in that portal not realizing what that means is that it's been brought in cleansed understood and architecturally integrated with the other data in the portal. Again many of the states are doing this thinking there are original data dumps of you know sort of the first version of data.gov 101 which is a collection of PDF files in many cases and putting truly integrated data out there so that we can enable the citizen data scientists that we'd like to. Not here it's not going to happen this week but it may get there at some point. These web services can then be wrapped up depending on what's actually going on here and I just tell a quick little story on this particular this is a get off the main frame rehosting kind of thing you know all people think main frames are too expensive because they haven't got their first cloud bill yet. And one of the interesting things that we did here was that the portal replaced the interface so interface code is about a third of the code that you have in your legacy systems and these could be moved into tiny bit web services that actually function much more efficiently and effectively and most importantly can be parallelized in the process of doing this. So again portal as a technology some things that are very useful here most people are not utilizing they're kind of falling into the real advantage. One thing that I've not seen for a long time and I was very familiar with these guys when they were around they had created a XML portal for everything in SAP and it looked like a terrific idea and then they got bought by SAP and I'm not sure what happened to it maybe somebody will tell us whatever happened to the top tier product because it looked like anything could be connectable to anything else potentially under this portal with it certainly is the way I used it in production so it was a very interesting piece but of course what I used was never part of SAP sometimes software goes places to die sometimes it actually comes back with a new life. These portals can also be used as data quality tools where you can do the kinds of analysis on the specific tools that we're looking at. A couple more things in here just more acronyms unfortunately ETL most people are familiar with that I've got a group here in Richmond that decided that data engineering is the primary focus of these things it's a good definition not correct it leaves out some other things but it's an excellent place to focus because of the leverage that you have another group are looking at enterprise application integration and enterprise information integration so if these things catch on they'll eventually become parts of the Gartner findings and things like that that'll come out there for example here's a one from a couple of years ago where the orange tables are the base tables with the other tables that were I'm sorry I forgot backwards the teal tables are the base tables and the orange tables are the virtual tables and you know Amazon can do these things now so a lot of these technologies have migrated towards clouds this is probably one thing that your mainframe will have more difficulty doing than not although you should see the mainframe we're going to buy at VCU it's going to be a doozy so we've looked at a couple different things here again three-legged stools taking care of ten percent of our technology problems so I really left you with a big challenge of addressing these other people with process problems but that's not what we created the survey for around this let's take a couple steps back on all this and just sort of think through some key Gartner findings around cloud issues interestingly enough around machine learning also things that are becoming more and more useful out of there one of the things Gartner's saying is that cloud is going to become a commodity within three years and the only reason you lose a Google Cloud is because you're really interested in what's happening on YouTube or the only reason you use a Microsoft Cloud is because you're really interested in Office 360 and or LinkedIn good sources of data by the way both and the only reason you use an Amazon Cloud is because you're interested in retail we'll see whether Gartner's correct that's one of the wonderful things about Gartner they do this and they put a date and the probability on it so when we take these things all together what we're trying to do is get people to realize that data and data technologies really only represent the tip of the iceberg and if we can get them to hear which is a desired state that would be wonderful but truly what they really need to realize is that this is more of the correct version of this and we're getting all sorts of people that really want to get into going digital all right you probably are under pressure from your own management go digital or somebody knows what that means but we do know something it's not possible to go digital without at least spelling data and you need to spell it correctly in order to get it to work for you so this data process of trying to figure out where data fits into all of this is so foundational and so fundamental I was so proud of my undergrad today because they really got it this is my own personal logo that I use any bad data into anything awesome is going to give you bad results and yet I saw this recently out on LinkedIn it was so recent and it was really scary and it's got this I just realized something I've got something awesome and you put bad things in you're likely to get bad things out by the way I'm going to consider that chocolate ice cream so yummy stuff in there and it's a real interesting insight was it's true with blockchain it's true without blockchain and yes the fact that that's a recent technology realization for you I hope you're very young that the only thing I can say in there so this gets us to now what we talk about garbage in and garbage out and gido means that garbage is going to be there it's going to sit over top of everything else and if you've got the perfect model you're still going to get garbage results if you've got a wonderful data warehouse you're still going to have garbage results if you've got great machine learning business intelligence blockchain AI MDM data governance it doesn't matter what you're doing they are all going to be dependent on having that good data and if we don't correct that good data then we will never be able to good good information out of whatever technologies we are attempting to use as we get through this so in order to get to this quality working through all of these bits and pieces I'm leaving you with a couple of specific takeaways here that are just some lists of tools that you can take a look at from our dimbok on here and lastly with a little bit of guidance from Gartner which is to say that the process of buying should really be considered an investment process and just like any investment is important to kick the tires it's important to check references most of this stuff has been around for some good amount of time most of the vendors will be very happy to put you in touch with clients and sometimes you'll need to sign a non-disclosure with them but that can help you understand where you're going the biggest problem that we have in the technology aspects of data management is that the vendors products are up at level 10 and most organizational procurement processes are down at level one and it's not a very good conversation and it leads to too much rework and change and that's just very, very unfortunate so take a look at the idea of looking at the capabilities but also make sure what data they need to have it wouldn't be any good to buy a great tool if you didn't have the right data to feed it but it's an interdependent set of processes use that automation to free up scarce specialist resources a book that has been wonderful for this a very unexpected aspect is that my wife told me when I met her she said before we have any business conversations you need to read a book called The Goal by Elie Golderat and I said what is it and she said that's all I'm going to say if you don't read the book we are not talking about it okay and I went read it and this is a very good way of describing it's basically the theory of constraints and saying that in your organization there are things that are keeping you from getting things done find those places look for those circles around a couple of key specific pieces where data automation technology can be helpful use it for that purpose make sure it's has a good business case for that purpose that it does achieve positive return on investment and then update your policies and governance around all of this again if you're into machine learning and all this sort of thing stop paying people to develop algorithms that we can't feed with data it doesn't make any sense so we're back here at the top of the hour and it's time for us to invite David back and turn it back over to Shannon and do a sort of questions you guys have about this technology Peter, thank you so much for another fantastic presentation if you have questions for Peter or David feel free to submit them in the bottom right hand corner of your screen and just to answer the most commonly asked questions just as a reminder I will send a follow-up email to all registrants by end of day Thursday with links to the slides and links to the recording of this session and everybody's really quiet today let me scroll up oh dear I put them all with my droning I'm so sorry everybody they'll come back and listen later on oh no it's been good there's been a lot of chat going on my station then let me scroll through here let me just look through some lots of questions you get to pull them out lots of comments here we go so Peter how did you overcome the challenges to quote-unquote sell the concepts so let's let's go back to the premise on this the premise is that an investment in technology will produce positive returns on investment and that you're doing it from the organization's perspective again I'll put David on the spot here because I know we want to drive him into the conversation but I'm going to say that David's tool costs a dollar it does not I'm making that up totally and I hope I'm not blowing any of your sales material or David but you know an investment of a technology that would save one of your knowledge workers maybe a hundred hours is a very easy calculation to make on the other hand if you're not going to save the knowledge worker but 10 hours maybe the dollar isn't right it's probably still a pretty good piece by the way I would buy it for a dollar certainly on that David you got any thoughts on that in terms of if you're determining your ROI when you do this kind of an investment we haven't been studying it we don't have a lot of good data out there the way we do some of the other industries that are there and that's a reflection that our discipline is probably less mature than we'd like it to be yeah there's I mean there's sort of the hard metrics of how much time each person saves in their day-to-day job there's also some soft measures of ROI that include things like discovering new elements of your data that you might not have seen in using previous technologies and there's also sort of the added benefit of having extra time to explore new insights on your data to find new analytics projects to find new data science projects whatever it might be so there's sort of two elements of this there's the time savings on one end and there's also the you know benefit of having this extra time and spending it on more creative outlets because we've called this an opportunity benefit yeah exactly yeah exactly so given given that type of a process there again I would not look to take your organization and completely automate everything that's going on if you're running a data management group right now or you're a data leader of any sort you know the things are probably happening or your paycheck wouldn't be coming to you so the question is what makes an appropriate investment and what's a good way of introducing this technology you know to an organization such that it gets adopted again I'm reminded of a sort of funny thing that happened a couple of years back which was that Mitt Romney had gone to a Wawa station pushed some buttons on a a computer and his sandwich had popped out the other end after the people made it for him and he was just amazed he talked on and on about that until his handlers got him and said stop that sir that's been around for years and years and it makes you look like you're out of touch most of the the things that are working in your organization are good but I guarantee you a good inspection we'll find some things that could be improved and that's the first place to look to say what can we do with this technology again some of the technologies are multi-purpose and they can solve multiple different pieces of it others are relatively single purpose again if you're going to use householding information in there it's probably not going to be a real good way of determining whether your telephone numbers are all correct that makes sense and we'll just rely on Shannon for that yes it does diving in here we got some great questions coming in in the Q&A section so where is the data management data strategy right now in the hype cycle and do you have a use case with the hype of microservices architecture I don't know the answer to either of those questions I haven't looked for them but they are definitely lookable oppable David do you have any information on that can you say that question one more time Shannon sure absolutely so where is the data management data strategy right now in the hype cycle and do you have a use case with the hype of microservices architecture I'm not sure I haven't answered that one either back to you Peter no problem yeah we don't know why you guys there's probably something out there that we could look up but I just don't know what that is I would say that you know unfortunately most of these things come and go through this exact process that there's something useful in almost every new technology development the one I very familiar with is service oriented architectures think about it for a minute if you had a tool that helped you identify services or a set of tools that would help you craft services in an environment that would be terrific but ask the question I ran across one that was a real interesting logistics company out in the west and they we brought me into the project and and I said we're moving to SOA because our mainframe is too expensive you know I always disagree with the mainframe being too expensive as a general premise just in general sometimes IBM will jack the prices up to get you off of old technology but that's not the same thing as getting off the mainframe being too expensive anyway the idea was that they're looking at creating these service oriented architecture to replace this and so I looked at the major modules there were six major modules in their architecture that we had put together for them and that they were finishing up the first third of the first architecture module and they had spent 85% of their budget so I'm hoping I painted the right picture there right they had six modules to implement they'd created one third of one of them and 85% of their programming budget however already been spent now I tell you that because I said well what did they build and they said oh we've got more than 1400 services I'm sorry there's no program in the world that's going to look through a list of 1400 services to find a specific one you've got to have them organize it instead of categories and you know what we already do that with GitHub and several other very good technologies why reinvent the wheel around this again so just a comment on on those types of fads the microservices are a trend off of that a very good trend a lot of it incorporating again very nice XML based technologies that are just significantly good but I haven't seen the ability to weave them into a larger architecture to be of support for an enterprise yet so for the data management text if the client is not well versed on platforms will hybrid imprim or cloud be viable anything's viable the real question is what are your business objectives so I'm pretty sure David will back me up on this as well when somebody goes into look at buying his technology for a specific project if you don't have a good idea what it is you're trying to accomplish then it's very hard to say any technology will be better or worse in terms of supporting it your goal with moving to the cloud might be to give more people access to it however there are many types of access that can be passed around where if you can pass links and move into some of the open data kind of models and things like that will be much much less expensive David maybe you could talk about the your from your perspective the importance of focusing on something specifically before you go and select the technology yeah I mean without knowing exactly what you're trying to accomplish with a change in technology it's hard to properly evaluate it so you can see some of the new technologies available and be blown away by what they can accomplish but if you don't have an idea of how that can be implemented at your organization it's not going to get you very far so I mean whenever we're doing an evaluation with a prospect we always first go into what's your use case what are you trying to get out of it what is the timeline do you have these specifics available ahead of time because really without those specifics it's sort of a waste of of everyone's time and given that type of a context here I don't want to put you on the spot but I know that we used to be able to download a copy of trifacta from your website right so somebody could actually try it out okay so yeah we we actually have a browser based free version now so if you go to trifacta.com and click on the start free button you can create an account use it using chroma firefox and upload your files and start wrangling pretty instantaneously and that's the kind of exploration that you were describing earlier again I want to just really emphasize this point people have latched onto two things in data that has going to solve all our problems silver bullets things that Clive taught me don't and never will exist and that would be the silver bullets so the the technologies can in fact make things faster but what we have to do is we have to be facile with the technology so that we can actually make the appropriate use of the technology I can remember coming in back in the day when it used to take us literally a weekend we would start a server up on Friday night and hope it would be done by Monday morning you know it's how long it used to take some of these these things now you can do them much more quickly in the cloud but we we eventually you know said to this but where have you been the last well I just went ahead and analyzed all the tables we said oh no we discovered that nine tenths of them were redundant why are you doing the same profiling of each to the tables to prove something that we already know so yeah absolutely the the focus there's got to be on that business problem and if that business problem is not well articulated then it's very hard to see whether the tool is successful or not that's not what the tool vendors want and it's certainly not what your business wants speaking as as an enterprise architect and before I'm parking on a data science project the data should be right it should start with proper data so what would be the top two activities to to get this right first of all the way we train data scientists is that we only hand them good problems so that's where DJ Patel's quote was so important when you get in the real world data doesn't look the way it looks in the textbook to double down on that point too again you all know that I work at Virginia Commonwealth University the company data blueprint is partly owned by them as well and yet my colleagues from literally around the corner in computer science will come into me and they'll say Peter do you have any data that looks like this and I'll say well I'm not sure but what are you trying to accomplish and they'll say because I've got a great algorithm and you know right there that they're thinking about the problem incorrectly yes we should be looking to improve existing things but what we really have to do is find out what business problems organizations are actually facing and then develop our capabilities as organizations to do this a big piece of our capabilities are going to be our facility with the tools so the easier a tool is to use and the faster it can be brought to be become efficient is something that's going to help you to reduce the gap between where the typical vendors are I've seen for example many organizations that think they need a master data management solution that costs ten million dollars or so from one of the big five vendors be able to actually solve that particular problem for a couple of hundred thousand dollars of of data development work and it's a truly it seems miraculous and I wish I could get paid on the percentage I save but that's not the way the world works the other thing that happens though and I David maybe want to elaborate on this after I finish but when you walk into an organization and every data scientist literally a hundred percent of them will stand up and tell you I spend eighty percent of my time cleaning the data munching the data exactly the type of thing that David was talking about that means if I take them and I add to them some discipline around data management and perhaps a little bit of technology even if I only reduce their unproductive time from eighty percent to sixty percent I've doubled their productivity so that gives you an idea of the kind of leverage that we're facing on all of this I don't know David maybe there's an example that you guys have run through that you may shouldn't tell us the company or whatever but can tell us you know something along those lines of how that was instructive and helpful yeah I have a company in mind it wasn't data scientists who were doing this work but it was a marketing intelligence firm who had ten analysts each of which had about two customers where they would be getting all of this social media data opening it up in excel and spending pretty much half of their week each trying to prepare this data in excel in order to get it into reports that they could send back to their clients and that was the value that they were producing was these reports so obviously spending half of your work week in excel doing all of this data preparation work is extremely frustrating and by bringing in trifacta they were able to create sort of a master recipe to automate all of this work it wasn't particularly challenging work but because they didn't have the right technologies in place they couldn't do it productively and they were spending so much time so once they were able to get it in place there was a single person who was managing the data flowing through trifacta and every other analyst had all of that time freed up to work with additional clients to find additional insights for their clients to bring more value to their clients so I mean nobody goes into the field of data to do this preparation work that's not the glory work that's not the fun work so the more time that you can free up to do work outside of that the better David I'll give you a part of that story to compliment as well so I worked with a group that had the very same type of problem and we brought in some similar types of technologies to help them but what was interesting about this was that we were able to justify the total cost of the project out of the savings in the HR dollars because the analysts were so dissatisfied with the work that they were doing they would only stay in the jobs about three months and then they would leave so just to reduce the turnover they figured it cost them $50,000 every time they hired a new analyst and you can do the numbers there and come up with some pretty good things so again it's not an uncommon story what David is describing and what's really sad to us as data professionals is that we can't get more people to understand this better we have to blame ourselves because we're not being as articulate as we could because it's a very clear cut case to most of us all right I'll get off my soapbox back to you Janna I love your soapboxes those are good so you know in order to govern the data we need data-minded people so and so far I've never seen established data roles data steward data owner chief data officer et cetera do you know such standard description of these roles oh my goodness I get to plug a book one called the case for the chief data officer available at amazon.com by myself and my colleague Mike Gorman I mentioned earlier in this we also have a society called the Society of International excuse me the International Society of Chief Data Officers that's working in conjunction with Dama to help refine these but I'll add something interesting to the question on this when I go into organizations one of the things that tells me about the maturity of the organization is the existence of these categories right now the U.S. government only tracks two data categories database administrator and data administrator they don't track data scientists they don't track anything else that goes into these and it's unfortunate because it doesn't give us a lot of good data about ourselves as a profession so Dama may in fact have to start doing this but our goal is to put some meat on the bones of this again I said before immature discipline and try to get people to realize that this is something that we actually need to pay more attention to to formalize these things at least within our organization so if I walk into a company and they've got categories as we're in the question there they're a pretty mature organization whereas it's all they've got as yeah yeah there's somebody handles data he's over there in the corner you know that's that's probably not an organization that's high on this maturity that's a great question very interesting any of the comments on that you look for yeah I think you I think you covered it all right an organization has generated its critical data elements and profile them what would you recommend as next steps other than mounting projects to remit remediate discovered data quality issues looking at the data at rest is good the next step up from an organization is to look at data at motion and to start to trace the workflows through so one of the things I promote is something I call more of an active data governance data governance will increase quality of data as a result of people doing things and then putting new controls in place and eventually that data will come in kind of like sitting at the bottom of Niagara Falls and expecting somebody upstream to change materially the quality of the water if you've ever been in Niagara Falls you know that changing that water quality is going to take quite a bit of work so I call most data governance efforts around that sort of passive data governance it's not really passive but it gives you the sense on the scale I also like the data governance professional group to be responsible for going in and remediating data quality problems data architecture challenges and while those are really good things to work around you can even go a bit beyond that if I can I try to make sure but part of the data governance charter and the charter of the chief data officer is something that we used to call business process reengineering everybody in the organization as you increased literacy through the organization should be responsible knowing where the data comes from and where it goes to a very simple example on that one hospital system I worked with had a default admission code and most of the people who did the admitting did not bother to admit with an admission code because they were paid to be optimized on speed so they would bypass that and everybody in the organization knew that that data was not good except for the data except for the hospital administrator who then decided they were going to go off and do a bunch of knee surgery which was the default hospital admission code again a well designed practice would have eliminated that process and prevented that mistake from being made in this case it wasn't a large mistake but nevertheless people do make good decisions on bad data and so we've got to put those processes and practices in there and we need some tools to go back and help us because we have to pay for the technical depth that we've incurred of so much neglect up to this point and you need professionals to deal with past neglect follow up to the question what that will be the time to start looking for a tool to help manage these things moving forward so I'll jump in here and say what I'm pretty sure David will want me to say anyway which I believe in 100% he just told you there's a free one out there go use it if you don't know things about these tools almost all of them have some sort of equivalent or free version that you can download and if you don't have time I understand everybody's stressed and we're in a period of you know pandemic at this point but at some point we will get to some semblance of a new normal whatever that happens to be go find your local college and university if you don't have time for it I guarantee you there's an undergraduate just dying to do something like that and say hey you get out there here's some stuff show me what you can do with these types of things and you'd be amazed at the results that will come back and then they can teach you instead of you having to learn it yourself but now isn't is the time if you're if you're doing this all without automation I will estimate there's a lot of savings that we can find in your organization David any anything to add on that do you guys have to do business cases around No I think you're right there's so many ways to start experimenting with different tools out there everyone's got free trials or free perpetual products so just a matter of doing it one step at a time figuring out what the most important things to tackle first are I'm starting to fill in some of the gaps that you think you have that's a really good guidance I've taken my class for this some of my graduate class this semester and I put them at a data set called usaspending.gov which has seven six excuse me five six of the entire US budget for the past 20 years out there and they will do something like oh I'd really like to find out if the flavor of grants increased or decreased under various administrations well that turns out to be a PhD dissertation so David's advice to start small and build up is much much better way to do it I've seen groups spend years doing that and discovering that there was no nothing they could actually learn from the data even though the data existed and they were able to retrieve it and put it into a model it doesn't mean you're going to get results to start small build up from there absolutely the best type of advice and David this question is specifically for you does trifecta cover the complete data management cycle process like data architecture business process metadata data integration etc it doesn't cover the whole data management process what we're mostly focused on is self-service data preparation so allowing business teams data scientists data analysts even some data engineers have sort of a collaborative platform where they can access their data from data warehouses or data lakes or databases profile and prepare that data and then publish it back to a data warehouse or directly to a bi tool so generally speaking it's after data has landed in their sort of centralized enterprise data warehouse or data lake but before the analytics process so it's not it's not a metadata management tool it's not a data movement tool it's more data refinement data preparation is that hopefully that answers the question I think that first slide that you showed up I'm not the first slide but that one slide that you showed up about the 80 percent that 80 percent is really your market right yep that's exactly right yep so think about that guys it's pretty cool grab a copy take a look I think we've got time for a few more questions here if an organization is generated as inventory of critical data and profile them what would you recommend as next steps other than oh I just already asked I'm just going to repeat that question here we go sorry I can give you the same answer if you want find your workflow pieces and think of your intelligence I was just going down the line it's data management it's so important why is this why is this topic neglected by almost all enterprises does it mean you can get along without sophisticated enterprise data architecture maybe because the functionality and operability is essential I think too many do try to for 30 years the academic community has been telling people that the only thing you need to know about data is how to build a new Oracle database and that's because Oracle gives it away which means that our managers as well as people who are not data people look at what we do and say you're only needed when you're building new databases that's why we're not invited to conversations that's why we don't have you know visibility and certain initiatives I found out at one point there was one third of the ERP implementations that were going on in one specific year had no involvement from any data people can you imagine such a thing so we have literally taught people incorrectly and I just say that as a professor I just I apologize to you because you guys have spent a lot of money to send your students off to be taught taught incorrectly and that's a crime and something should be done about that I hope I'm not blacking people up but we definitely need to change it again obviously Shannon you got me on a soapbox on that one David anything you want to add there Nope I don't think so I think Peter has it Sorry let me take all the hits on that one that's what that's what you No good So actually that's the three of us though so I had a course in database in my undergraduate when I went to my master's program they gave me the same database course over the second time which was sort of redundant and my PhD level they gave me an exam and I happened to have read the chapter from which the exam came out so I looked like a database expert on that and I was not at that point again I'd had those two pieces I didn't really start studying data until I got into the defense department late 80s around this and started working with Dama so I did not have the benefit of that and I know what we've been teaching people since then David did you have a course in data and what we were doing what did they tell you about so I studied math and undergrad and all of the statistics courses that I had had very clean data sets so obviously not super real world scenarios that we were presented instead it was here's this you know perfect clean data set let's do some analysis or regression on this data set without any of the upfront work that often needs to be done and in real world scenarios so that was that was generally my experience and I've never asked Shannon Shannon what courses did you have in data I got you here oh my goodness you're putting me on the side Gullahard Knox is perfectly good answer right expertise having having put together all these programs all these years yeah well you know it's just it you know I produced about 500 webinars so it's been a it's been a crash course for so much love the content on the website yeah so even data journalism as I as a career field we we teach them how to process the data once they get nicely cured data sets and that's a really important skill but at the same time with there's a lot of other aspects of this and I'm just I'm so frustrated that we we haven't solved that problem so let me just go back to one slide that we had early on there you go I will say I was an analyst in my past life and not just analysts you actually manage business flows too so yes so you know if this is data right data science is a piece of this but so is the rest of these things and if we don't even tell people I mean if you talk to data scientists they don't think any of the rest of this stuff exists because they've never been told about it and and the same thing happens in way too many places so we need to make more specialists around data and we need to put data specific qualifications in our knowledge worker job descriptions and I think that will make a difference we'll start and we start to drive the market indeed myself as a surgeon she just started getting into research and she in her big first massive big project she started working with a data scientist and finally the light bulb went on I understand what you do know I understand who you work with that's very good well that is right we are wrapping up right at the end of the hour here here I do want to say thank you so much and thanks to trifecta for sponsoring David thank you so much for joining us again just a reminder to everybody I will be sending out a follow up email to all registrants by end of day Thursday with links to the slides links to the recording of this session and additional information requested throughout here we've got a couple pieces we'll get some information on the trifecta download for you as well or the free trial again Peter thank you so much David thank you so much thanks to all of our attendees hope you all have a great day and stay safe out there we'll all get bored sooner or later take care all right thank you