 Oh, and welcome. My name is Shannon Kemp, and I'm the Chief Digital Manager for Data Diversity. We would like to thank you for joining today's Data Diversity webinar, Data Structures, the cornerstone of your data's home. The latest in a monthly series called Data Ed Online with Dr. Peter Akin brought to you in partnership with Data Blueprint. Just a couple of points to get us started. Due to the large number of people that attend these sessions, he will be muted during the webinar. If you'd like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the upper right-hand corner for that feature. For questions, we'll be collecting them by the Q&A in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag data ed. To answer the most commonly asked questions, as always, we will send a follow-up email to all registrants within two business days containing links to the slides. And yes, we are recording, and we'll likewise send the links to the recording of the session as well as any information requested throughout the webinar. Now, let me introduce to you our speaker for today. Peter Aiken is an internationally recognized data management thought leader. Many of you already know him or have seen him at conferences worldwide. He has more than 30 years of experience and has received many awards for his outstanding contributions to the profession. Peter is also the founding director of Data Blueprint. He has written dozens of books and dozens of articles and 10 books now. The most recent is monetizing data management. What's the most recent one now, Peter? We're up to data strategy now, Shannon. That's right. I love it. Peter has experienced with more than 500 data management practices in 20 countries and constantly named as a top data management expert. Some of the most important and largest organizations in the world have sought out his and Data Blueprint's expertise. Peter has spent multi-year immersions with groups such as Diverse as the U.S. Department of Defense, Deutsche Bank, Nokia, Wells Fargo, the Commonwealth of Virginia, and Walmart. He often appears at conferences and is constantly traveling. Today, Peter is joined by his colleague, Tom Garland. Tom is a data consultant for Data Blueprint, a 30-plus-year veteran of IT. Tom has done everything from quality assurance to programming, data analysis to architecting, and business intelligence to project management. All of this across a variety of sectors and industries such as finance, private health care, charity health care, government services, construction, discreet manufacturing, process manufacturing, retail, telecommunications, and consulting. And with that, I will turn the webinar over to Peter and Tom to get today's webinar started. Hello and welcome. Thanks, Shannon. We're so glad to be here. And just one more little point on Tom. He's really fanatical about those Rhodesian Ridgebacks. So we've got seven of them at home, you said? Seven, ranging from 12 years of age right down to a few months old with our most recent ones. And some of them are award-winning. They have confirmation-winning dogs, as well as champions in lure coursing, tracking dogs, and a variety of other titles. Fantastic. So the next time my dog gets lost to you, your dog can come find my dog. Let's hope so. Okay, now I've got a beagle, but she's deaf and really old. Beagles tend to be really loud, so they shouldn't be a problem. Yeah, she doesn't bark very much because she doesn't hear very much. No, okay. We'll get back to the data stuff. Thanks for joining me today, Tom. It's a real pleasure. Glad to be here. So today's topic is data structures. And what we wanted to do was dive down to the most granular construct that we have in data management. Data, of course, are the original atomic pieces. That's the most piece. But all of the data has to be fit together in some fashion in order to get it to work. So that's what we're going to talk about today, is how do you put these little granular pieces of data together into structures. And maybe not so much how, but there's a bunch of strategic considerations around them. That's what we want to look at. So we'll talk a little bit about how this fits into the DIMBOK and all the rest of the things that we do as usual at the top. Get to what is a data structure. And we'll talk a little bit about history, because history, unfortunately, is kind of important here. Then we'll get to why are the data structures so specifically important. And we'll look at that from a perspective of different data persona. So actually, I think it's probably a persona, but I'm not sure my English isn't all that great on this. But we'll talk about it from who's thinking about it. That's really the key. And then we'll look at some data typology pieces and talk about internal data structures to fit the specific needs on it. Of course, as we're looking forward to your questions and answers as we finish this, as you've seen, Tom has a lot of experience in between the two of us. We ought to be able to answer just about every question that we're going to get to. So as usual, we start out with Maslow. Maslow, everybody for some reason remembers from high school, that if your food clothing and shelter needs are unmet, then you will not be safe. And if you are not safe, then you can never be long to a group or be in love. That gets us, you have to have love and belonging before you get to esteem. And these are all necessary but insufficient prerequisites to getting to self-esteem. So these are all necessary but insufficient ways of getting up there. Data is very much like that. And we've been working on this for many, many years trying to get people to buy into this. It's one of the simplest messages for management to get, but it's one of the ones that they don't know the most. So there are a lot of technologies in the golden triangle of what everybody wants to do with data. Master data management system, data mining, big data analytics, warehousing, service-oriented architectures. These all have a technological focus, but that really represents just the tip of the iceberg. And that if you don't put these foundational practices in place before you try the other things, the other things will take longer, cost more, deliver less and provide, excuse me, cause greater risk to the organization than if instead you learn to do it properly. One other point on this slide, and that is that the bottom half has a lot more to do with organizational capabilities as opposed to the technology focus up on the top. So we like the well-rounded solution. You'll notice too that those five areas, the foundational practices that we just mentioned, have now moved over into another diagram called the data maturity model from our friends at the CMMI Institute. And in our first part of this structure, we look at the data quality. How big is your data in terms of the amount of data points you can collect? How complete is your data? You may have a whole bunch of data, but for example, addresses. Do you have all of your zip codes filled in? Is it complete? Is it correct? We all know people change addresses pretty quick, so it's very difficult to keep those up to date. And of course, we always want as much of the most recent or up-to-date items as we can get. For operations, we're constantly looking at our day-to-day usage and collection of our data. Do we collect it on a routine basis? Do we actually deal with it in a much more efficient manner? For our platform architecture, yes, you can collect your data, but can you effectively analyze it? Can you manipulate it? Can you arrive at intelligence? Everybody wants the actionable intel, but without the current tools and hardware, are you actively getting there? And of course, our governance is dealing with a whole bunch of different methodologies in terms of do you recognize that your data actually has a lifecycle? Do you really take it from very first collection all the way through the entire process? Do you archive it? Do you have an end-of-life process? Do you put security policies in place? And have you identified the persons in your organization who maintain this data? So really key to this is that if you do all of these things well, it'll make everything that you want to do at the top of the pyramid a whole lot easier. One other additional point on this, too, which is to understand that we've got some supporting practices in place, but that the foundation that we've just described to you all is only as strong as the weakest link. Now we rate these on a one-to-five scale. This is the CMMI scale. Even if you haven't heard about it, it's a pretty fair safe bet that your boss has heard about it. So we like using the CMMI scale because it is the most well-known, the most tested, the most empirically validated piece in here. Just very briefly the way it works. If you have a pulse, you get one point. If you have repeatable practices, you have two points. If you have them documented well, you get three points. If somebody's measuring how long it takes to do the different things in your documentation, now you have four points. And then if periodically you get together and say, can we do it better, now you have five points. So that's the basis for this grading scale. We like it a lot, as I mentioned before. Let's just give a quick example of this. If I'm going to rate these four activities that Tom just described, quality, operations, platform, architecture, and governance at a three, which is to say they are proceeding with some documentation, but that overall the data management strategy piece is only rated at a one because they haven't started yet, the foundation can only be as strong as the weakest link. So consequently, it won't do this organization any good to put any more money into governance or quality or operations or platform because until they take that one and make it up into a three, none of the rest of this stuff is going to make sense. We do a whole webinar on just that topic coming up at some point so you can catch us on that one if you need to. Finally, the last piece of this, we're real proud. Dama has finally put out the Denmark version two. And I say finally because it's been a lot of effort. Again, in particular, thanks to Laura Sebastian Coleman for doing a marvelous job. Those of you that have copies of it already, we're real pleased and hopefully you are enjoying the new learning opportunities that we have all the way throughout there. So let's talk about what is the data structure and one of the easiest ways to talk about how data structures are is to talk about when they aren't. So this is a picture from a hoarder. We're not going to name any names, although my dad could totally fall into that picture. He doesn't listen to these things so I don't think he'll get too mad at me. My dad never likes to throw anything away. But of course what you're seeing here is not a very good organization. Yes, somebody may in fact be able to tell you that there's a piece of paper with a phone number that you need to have over in the back left-hand corner of this room, but even getting to it might be a difficult process. So that's why you need data structures to avoid this, but let's talk a little bit more about unstructured data and transforming. It is very difficult to transform or bring unstructured data into a usable format. Most of the time what we do with unstructured data as wine glass analogy shows us is that we are actually accomplishing this by adding boundaries. The data is not transformed, but the material can now at least be handled. Wrapped so that you can deal with it. So really the proper usage on this would be somebody tells you that they're going to take your unstructured data and turn it into structured data, hand them a glass of water and ask them to turn it into wine or a lump of coal and turn it into gold. They're going to have about as much luck with each of those doing it. Our proper way to describe this, then, would be non-tabular data and tabular data, and that is something that is very realistic and can be done. The real operational question is how much do you need to have transformed? Let's talk about these data structures. Here's an example of one that we pulled this just off of NIST, which is the National Institute of Standards and Technologies. There are also lots of good sites out there that will describe this to include Wikipedia on this. So a data structure is an organization of information. It's usually in memory. Now, we make a point of saying in memory because you may have your data stored on disk different than it's organized inside of the computer, but right now just make an assumption that storing it on disk has different characteristics than processing it, and that's what we're trying to get there. So we talk about better efficiency in most cases. You may hear data structures named as cues, stacks, link lists, heaps, dictionaries, and trees. That may make sense to some people. It may not make sense to other people. Probably if you have a computer science background, you're going to be familiar with these data structures. There's another aspect of it, though, as well, which is the definition of a conceptual unity, which means that it just means one thing. In this case, the linking of me with my address, the householding construct, and that means Peter lives at that house until Peter doesn't live at the house anymore. So we may also, though, in the way we're presenting or transferring information, include redundant information. For example, if I have a list of things that I get, and it's interesting, LinkedIn posting recently where people were commenting on the number of three-letter acronyms after one of my colleagues' names. You've got five master's degrees, so you've got every right to have all those initials after his name. But if we're going to put the list out there and then say the list has five things in it, that's redundant information because, of course, a computer can count the number of things in the list. So we don't necessarily need to have that or the number of nodes in a subtree or whatever they are. There are other characteristics that you use with data structures here as well. And these include things like the rules for processing it. We call that the grammar of the data structures. We talk about constraints where they may have certain things to many, too few. Brazil, actually at one point, made the primary key for voting the parents' names and the date of birth, which works perfectly for all of the population except for the twins. And there was one of each pet of twins that could not vote because they had a bad data structure there. What order are we constraining the data into? For example, if you ever analyze a list of names and find out that all the last names tend to start with A, you've probably got the sort end of a list. It's unlikely that all the last names would start with that. In fact, there was a very interesting application on the dating site that I read about recently. The guy was checking that the book's called Data for the People. Very interesting book. I'm only about halfway through it. But he found out that on the dating site there were so many more people that were 29 than were 30 that it was just not possible to actually have that be real. And people will lie about their age so that if people say I'm only looking for people in a certain range, they want to fall into that certain range. I'm not really 30 anymore. That sort of thing. Uniqueness. How do we identify a specific individual one out of all of those? What's the arrangement of things? Can it be a hierarchical or relational and network type of thing? What sort of balance do we have in terms of how we're setting it up and what sort of optimality are we trying to do? Tom, one of the first conversations you and I had was about price, quality, and speed. Yes. And you said? Pick two or three. Exactly. Pick two of the three because you cannot possibly do all three of them in there as well. And that's what we mean by optimality. We may have robustness, which may be very important relative to your checking account, but your Facebook page, I don't know if you have a Facebook page. Of course. Okay. Well, I have one, but I don't know what to do with it. So you're better off than I am on that. But anyway, if somebody looked at my Facebook page versus yours, they'd probably see you probably have something to do with the dogs and connecting people up with that. I obviously dissiminapt and so I don't do any of those sorts of things there. But, you know, you arrange it for a certain, it's oriented towards certain things. You don't tweet, do you? I don't. I tweet badly. Well, okay. Just like some other people. Yeah, exactly. We'll try not to get too far down that particular rabbit hole, but yeah, yeah, yeah. Here we go. The right track. Okay. I could take it. I tweet better than some. That works. Okay. Our data structures are typically expressed as it says in those architectures. We start with our details. We might have customers that we pull together all the information on those customers or invoices that those customers generate, vendors and payments out to those vendors that they generate. When we have those data structures in place, we can pull it all together into slightly larger scenarios like models. This typically takes the form of an application. You might have an accounting application or a sales application or a marketing application. Those get brought together into a numerous architecture where you pull the various applications together and it meets a larger need. Our words for these on the data structure side are attributes that we organize into entities and objects. An example of these entities or objects might be a person, a patient or an employee. They have attributes such as name, date of birth, their residence, the number of children they may have or a phone number. I'll just point out something on here too. I put down DOB when you told me date of birth the other day as we were pulling this together. You and I know what DOB is. Correct. It's very likely people who don't know and weren't part of that conversation would maybe have a little bit of trouble figuring out what a DOB was. Although, again, with our experience in this, we would look at this in an address field or a person's field and say, that's probably date of birth, but we would always ask the question, wouldn't we? We'd have to because you can't assume that. So attributes, entities? The entities, of course, pull together and might have a model like a sales model, an accounting model, or even a reports model. Why don't you have a different reporting model than an actual thing? Is that an example of what I was talking about before? Are data stored differently on structure versus being processed? Yeah. In fact, in a reports model, you may be pulling from several different applications to merge it all together. Fantastic. And of course, the overall guidance of all of this is the architecture. And that's really the point of what you're trying to get together is to pull it all together so that you have something like a financial architecture or a business intelligence architecture. Or one I'm going to show you guys on the next page, which is a big architecture that none of you can read. So let's make it a little bigger. That's not helpful, is it, either of you? So the reason, Tom and I are kind of chuckling on this one is because architectures are very hard to conceptualize, but that is really what your architecture is comprised of, with these smaller things put together in larger constructs, put together in even larger constructs. Purpose is to work together. So the example that you just saw there was a back office trading system that was built from one of the large banks. And it was about a five-year effort for them to re-architect their back office trading system. So we had it in exactly this fashion. So let's move on a little bit. A little bit of history in this. And the reason this is important is we'll go ahead and take the back we were talking about. They built a beautiful system. It was very, very nice. The best system. Okay, now I didn't say it. You got to be starting thinking in another direction here now. They were automating their existing processes. And so for them, when they built the system, data manager was taking these cards that we used to drop through the systems of where did the cards go, who's got the cards, and that was the reason we punched the results on paper. It was pretty straightforward. We called it paving the cow paths because we weren't really thinking about re-engineering the business in those days. We were just taking existing tasks and making them faster. Automation. Automation. What that did, of course, was create lots and lots of information silos. And each of those silos had different ways of storing and processing the information. In charge of all that, we put somebody in charge called a data processing manager. Used that term for a little while. We actually came along a little further and said, nah, they really need a better title. Let's call them chief information officers. However, interestingly enough, they really don't have the same types of qualifications as we would like to have in the financial world. So just a little known fact, but the average CIO has a tenure somewhere between two to four years depending on how you measure it. Whereas the average CFO for a comparably sized organization is 12 years. That's a lot of difference. And the reason is because when we hire a CFO, we know we're hiring somebody with a CPA. They might have a Masters of Accountancy degree. There are recognized specific degrees that are insufficient but necessary prerequisites to the qualification. When, however, it gets to the CIO. Things tend to disappear a little bit. Do the candidates really climb through the ranks of the IT department? Or is it far more common that they come from outside the business and not typically from a technical discipline? In many cases, I think most people know that when the CIO does achieve his position, it's mainly because he speaks business speak rather than geek speak. No CEO wants to talk to somebody who's constantly throwing out the geek speak term. And if you think about it, how often does the CIO really talk about things like TCP IP addressing, subnets, clustering indexing, API calls or multi-threading? That's not typically what they do, mainly because they tend to come from the business sector. How many CFOs do you know that actually start in the accounting department and work their way up? The vast majority. How many VP of sales started in the sales department? The vast majority. How many CIOs started in the technical department? Almost none. We're working on it. So our goal here is to say we are currently immature. It's not okay for us to stay immature. And we do have to do some catching up because our accounts have about an 8,000-year head start. Yes, they do. Another real big point that we just can't really express the source of our frustration on this is that we teach knowledge workers virtually nothing about data. And yet that's the definition of a knowledge worker, somebody who uses data 100% of the time. So let's dive in a little bit further on this. Why are data structures important? And the main reason is because what you're trying to do with data is what you're trying to do with all of your assets. You are trying to leverage it. Now, leveraging is an engineering term. And it's important to understand that from a leveraging term, we also understand and employ leverage engineering technology. So in this case, the lever is shown as the lever itself. And we put a fulcrum in there. Those are the two technology pieces. There are people involved and there's a process. So in order to get something large to move, we need to do that. Data is exactly the same way. We need to implement data-centric technologies, processes, and human skill sets. But we also need to eliminate much of the rot in our data. Rot data is data that is redundant, obsolete, or trivial. If you remember back to that picture of the hoarder's room, it was backed that way a couple slides ago. There's no telling in there what is redundant, obsolete, or trivial. But we can guarantee that pretty much a lot of it is. So helping organizations understand that treating assets more, excuse me, treating data more assets like lower IT costs and increases organizational knowledge worker productivity, this is a wonderful opportunity. And in that area, I tend to define the asset as it's used to generate revenue. How much data do we really look at and say, hey, we generated revenue because of that data? And I think it's too few in today's market. Definitely too few. And of course, if it's raw, it's hard to do any of it. Correct. We'll have just right. Not too much, not too little. All right, well, let's get our levers right on there and talk a little bit more about data structure in a development context. Data structure in a development context works out that many applications or projects develop their applications in a very narrow, siloed type of environment. And in many cases, that's exactly what happens. We generate our new application for pulling together accounting information. Well, who holds the accounting information? The accountants. When we have sales information, who holds the sales data? The sales people. And it works out that every time we develop a new application, it's rare that they individually talk to each other. There should be somebody in charge of that. Yeah, there should be. But as it turns out, we keep looking at that and the people who really make the decisions on it are the ones who lured over the data in the application. This is the very definition of siloed data. And the data collectors end up having operating virtual systems in terms of controlling who gets the data. There's a couple of examples that I pulled from one of the groups that we were working with at one point. Just a very difficult query. We're not going to try and parse this, but you can see that whatever we're asking the machine to do, it's a lot. And yet this group didn't know that there was such a process called query optimization, which is just simply going through and seeing if you can make the whole process more efficient. And it turns out you can see that this representation is a lot simpler than this representation here on that. Now, literally I've been to places where they run a query billions of times during the day. And if you could shave just a tiny, tiny piece off of each one of that, you would help organizations avoid this death by a thousand cuts, which is what we tend to get. Yeah, a billion anything is such a huge number that any operating inefficiency will consume literally orders of magnitudes, more time and effort, and money in order to process it. I used to work in the phone company, and this is back in the 90s, so of course we did an awful lot of manual processing. It was very typical to have about a 2% error rate from the people that were processing all the telephone orders. 1% was considered very good because it was half the error rate. Well, now we're in the time of big data. 1% of a billion processes is 10 million. Can you imagine 10 million errors coming through your business? Or 10 million lost bags or diverted flights, or 10 million anything? Anything. The level of data processing is just way beyond exploded. We can't have that level of error anymore. It's absolutely intolerable. So the data structures organized into an architecture help you to support strategy. Now the first point is to understand that if you don't know what your data architecture is, it's really hard to make use of it as a strategic implement. So important things about architecture is know what it does and then know better still what it's capable of and what it does poorly in there. The real question was, were your systems explicitly designed to be integrated or otherwise worked together? Probably no. And if they weren't, the likelihood that they will just happen to work well together is approaching zero. Data's got to be working at the most granular level. So it's absolutely fascinating to me to investigate IT failures and just time after time after time we come up with the data was the real problem. But the data structures can't be helpful if their structure is unknown. So we've got to come back and take a different look whether we try to go for achieving efficiency and effectiveness or are we trying to get the organization to be dexterous, really flexible and adaptable. And the two examples that I have here on the slide are that if I had a restaurant that was serving a different physical dish for every dessert, every plate, asparagus had its own dish, carrot had its own dish, et cetera, et cetera. And my goal was efficiency. Then if I dropped the carrot dish, I would have to go back into the closet and find the carrot dish and replace it, which is going to take longer than an organization that says we're going to put all the vegetables on the same plate, in which case I just grabbed the next plate off the clean plate pile and move on. Neither one is the appropriate solution to your problems, but different organizations are going to have different goals. And if you don't understand the data structures of it, it's very difficult to get them to work in conjunction with what you're trying to get to. The data models are used to support strategy in a lot of different ways. And I'm going to give you a couple of examples that I had in my life that were here. First of all, when you have the data model, the data structure that supports strategies can be flexible and adaptable. It results in cleaner and less complex code, because there's things that databases should do that you can do in code and things that you can do in code that you should do in databases, and they get confused a lot. That's one of the things we go in and clean up on a regular basis. Yeah, that level of architecture is very typically something that has to be assessed on a case-by-case basis. And on the point of building in future capabilities, I would note that can you be visionary enough to know what this entails? It's surprisingly easy. You would think that building in future capabilities would be very tough, because you won't know what your business needs are out in the future. But when you're designing your models, you don't need to know. You just need to know that what you're building right now has to expose the data that it has, so that future applications can, in fact, make use of it. That's it. It's unnecessary to solve future problems. No bells, no whistles in your application development that costs big bucks and delay your delivery. It's just open up the data for examination or replication so that future applications can make use of it. Doug? I usually approach this in a slightly different fashion, Tom. I will say to people, can anybody predict the future? And of course, if any of us could, we'd be in the different business that we're in right now. If we could do it accurately, that's most assured. Ah, good point. We can all predict the future. We just can't predict it accurately. Important question to ask, right? But you don't need to in many cases. Well, and that's the point. And this is an objective characteristic. You're, with your experience, can look at a data model and say, you know, you haven't really thought about all of the future possibilities that may occur here. There is a better design, a better organizational structure that you could put together that would be more flexible and adaptable in the future. And in many ways, a lot of companies look at this as risk. Elon Musk went so far as to open up the complete design of his recharging stations and let any future business decide to use them. That's a bit of a business risk. But he decided to do it because he knows for sure that if more of them get built, more electronic cars will be sold. Are you having to wait for a Model 3 waiting list? I'm not on the list because I think that by the time I get on the list, it'll probably be about 2022. I had the same comment. Looks like a wonderful car. Can't wait to do it. It's going to be a few years. Exactly. Let me give you one more example on this slide, too, which is that this is a data structure that for a company that I worked for during the oil recession of 1979. I only say this because I was a manager at that particular organization and when the economy went south because of the oil crisis, they told all managers to start selling. Of course, we said yes. That's the way we should do. Salespeople sell and managers sells retail. We're doing stuff on the floor in malls. I remember where they're shopping malls. Where they're somewhere shopping malls. When I went to report my sales as a manager to show that I had been participating in the process, my data structure for manager was different than the salesperson's data structure. I had no place to report my sales. That was a very bad design because did you think that somewhere in the future your managers might want to sell? Like you said, it's about exposing the data. A better organization would have been not to put in the structure that you see here, but in fact to make everybody one common data structure. That's one of the big takeaways from this, of course. It's the fewer data structures that you actually have to manage, the better off your organization is going to be. Let's take a look at five basic data structures in the database world just to get us to more where we're looking at. There is a flat file, which is just something that starts and works its way through and already gave you all the example of if all the last names tend to start with A, maybe the file was sorted by last name and we won't be able to find them. In order to look through this data structure, however, to determine what or if something is in the contents of it, you've got to read everything. If I'm looking for something named Townsend, I start at the beginning and say, Townsend, Townsend, Townsend, no, no, no, no, no. Then I go all the way through it. That is a very efficient data structure for certain types of problems. It's a very inefficient data structure for other types of problems. Correct. Another thing that we said is, wow, that's Townsend, if I want to find the T's, I need to have some way of finding them quickly. So this is what we call index sequential or virtual index sequential V-SAN files maybe some of you have heard of. And it just says start looking for the T's over there. An index is basically a trimmed down version of the file as a whole. You start looking at the spot wherever you're statistically deciding to make your breaks and separations, and it works incredibly fast. And can yield some very unwieldy data structures, but that's part of what we've been doing for the past 30 years. There's always some ups and downs. Exactly. Next data structure is a network database. These are very efficient databases because they allow you to put linked lists together. So for example, a good way to conceptualize this would be that my checking account might have a balance, but that the actual checks that are associated with it because I can store one check a month or 20, I can write one check a month or 20 checks a month, or 200 checks a month, the bank doesn't care. As long as I've got money in the account, I should qualify that way. Then this becomes a very efficient way of processing the structures. What you're seeing here on the screen is a one-to-many type relationship that we have between these things. The network database, another one very similar, it's almost a subset of it, but not exactly as a hierarchical database, hierarchical database that links all the data structures to having parents and children. Something has a parent, something has a child, and you work within it that way. Again, a very good way of describing and representing data structures at work that are in there, but we have a problem with it because what we've taught students for the last 30 years is that the only answer to your question is the relational database. That is a problem because now the chapters in the textbooks on network databases and hierarchical databases are not even included in the textbook. Because of that, students don't know. Now, the important part is the network and hierarchical databases are typically at least an order of magnitude faster than a relational database. It's not that a relational database is bad, but again, it's the right tool for the right task. If you took non-relational database processing away from most big corporations that are doing it, they would stop processing the next day. We get things like the end-of-day job is running 25 hours or 30 hours or something like that. There are all sorts of other databases. There's associate, concept-oriented, multi-dimensional. None of these have really taken off. We will get to SCAR schema and data vaults and third normal form in just a little bit here as well. And that leads right into the point about you can't really have that one-size-fits-all because not everybody's needs are all the same. This simple reality of varying platform architectures basically means you're either required substantial efforts of migration if you wish to pull it all together into a central database, or you're prevented from centralization altogether for the reasons it just takes too much time and effort to put it into place. Data marts, as a result, are starting to flourish. They're a lot better in terms of being a little tinier much more focused in what you're maybe looking for. They're quick to build. They're usually populated by automation so that you don't really need to have manual processes pulling it all together. I once worked for a company, and this is a great example here. I worked for a company that was typically processing 50 invoices per month. Now, to be fair, the invoices were about $150,000 each, but there were only 50 of them per month. They were coming off of a legacy system that was DOS-based accounting. That tells you how long ago this was. Their solution was to go to Oracle Financials. A tad bit of a leap. Yeah, the kind of system that can process five million orders a day, and they needed it for a 50 orders a month system. They kind of leap past where they should have gone. This should have been something much more akin to Great Plains type of accounting. But somehow they decided that that one side fits all, and if they were going to be operating the world-class system, they needed a world-class software to do so. I've always heard that Oracle Sales People are so good at what they do. They were in this company. The real key is that while we like the answer to be one data structure, it does not possible. It's just not possible. Let's look more specifically at that, and I'm going to ask one more piece in here, too. A lot of people come to us and say, I'm not sure if my system's legacy or not. My definition of a legacy system is anything that's in production. Because if you go beyond that, it's just a question of how old, before you decide how old, is it a day old or a year old? What should we do? We're really going to go around and track the ages of our systems. Anything that's in production is legacy, and that's what you need to treat it. Because what happens in legacy, and we talked about this a little bit earlier, is that they form their own standard types of data. And each of those types of data work really, really well. However, when you go to tie it all together, and tying it all together process, generally takes an awful lot of colluding, if that term rings a bell, where you have to basically patch the systems together. And really, Tom, these are entirely avoidable occasions in there. They hire you and I to come in and do this and straighten these things out for them because they're not schooled in this area, and they don't really know how to do this. But if we had designed it properly in the first place, we never would have had to do it. Correct. That's sort of our goal. Because that's where you end up with this kind of a thing here, which is just a mess of one form or another. Or let's put it in some actual number terms. If I have six legacy applications that I want to connect, and I want to connect application one to application two, three, four, five, and six, and application two to applications one, three, four, five, and six, et cetera, et cetera, there's a mathematical formula that tells us that the upward theoretical limit of complexity here is represented by N being N the number of interfaces, times N minus one divided by two. I've got six N times six times five is 30, divided by two is 15 interfaces. Okay, so that's not really bad. But World Bank of Canada told me I could use their numbers. So they have 200 major legacy application systems, and they have 5,000 interconnections between, well, technically 4,900, but we'll call it 5,000 just to round it up there. If you're the CIO or the CEO of that company and you say how complex is our environment, and we say, well, we've got 200 systems, but they interconnect with each other in 5,000 different ways, you know you're paying a lot of knowledge, work, or expertise to maintain that, assuming you even know what's going on. And we've both been to companies where they walk in and they say I don't know what's going on. You guys have to help us figure it out. That's very typical. And the one thing that I keep getting hit with is, oh, you know, we forgot to put some piece of information that we really need to throw on the reports. How bad would it be to go in there and put this piece of information into the interface? And I'll give you an even, that's a good example. One of the things we keep discovering is that our key length is not long enough in some of these things. Yeah, people don't anticipate that they actually are going to get beyond the 2 billion mark in their invoices. And inside the computers, 2.1 billion is a very harsh number reality that we're finding out as a limitation. Kind of like a Y2P thing here, right? Yeah, exactly. Let's take a look at some numbers here particularly. Again, if n, back to our n times n minus 1 divided by 2 is 6, we have 15 worst case sites that are there. If the number is 60, then I have 1,770 interconnections. If I have 600 of them, the worst case I could possibly have is 179,000, 180,000 rounded up. So let's look at where Royal Bank of Canada comes in. Remember, they had 200, they could have had 19,000, and they only hit 5,000. Oh, well that's great. And so you think about what it takes to manage 5,000 interfaces and oh my goodness, that's a lot of work no matter what as well. So let's talk about the next step in this, which is that people are trying to help smooth the data structures and come up with convenient compatible data structures that can be used for multiple applications as opposed to just one pile of data. This gives us the idea that we can move this stuff to something called a hub processor. This is where it looks a lot simpler on PowerPoint than it actually is in the real world. That looks easy. I will tell you it's easier but not easy. We're never going to get to easy in IT. So what we're looking at here is that we may want to expand the number of data structures in application one that are made permissible. Made visible as Tom was saying earlier on. Another variable is the number of connections between the systems and the hub. If I want to add application number seven to this, I've only got one hub that I need to connect it to in this scenario. So that would be easier although I still got to go back and figure out how many data structures in application seven that I want to have convertible because I still need interconnectivity. When I'm going from application one to application four, I need to know what those rules are. That was one of the things we were talking about a little bit before. So there's a lot of complexity here. Again, our goal is to make things easier but not easy. The summary then is that one data structure is probably never enough, but most organizations have way, way, way too many as well. So the question is how do you find what is correct? And the answer is there's not a lot of expertise outside of our data community here that we can actually look at in order to do this. I'll very quickly run through a little bit on some data persona because your perspective on this is going to differ depending on what you are. Now we're going to deliver the rest of this in a spectrum going from operational to analytic. When I say a spectrum, it just means the real answer is probably somewhere in there in either one extreme or the other. So we do talk about operational versus analytic requirements. So the first one is an operational performer. It's somebody who is watching alerts, looking at the dashboards that Tom was describing earlier and things like that. And when a light goes off, they have a response that they're supposed to take. Then you may have a manager of that group who may be seeing too many people responding to lights and say, my people are responding to lights instead of watching the rest of the board. So we should do something about that and they may go back and ask something to be done at a more programmatic level. Then you have somebody called a data analyst who comes in and again, these are not very good definitions and it's like I said, it's all across the spectrum of this, but the data analyst is really doing something like saying, oh, should we go into this line of business or should we tell umbrellas when it's raining? You know, things like that. Then we have a data miner or scientist representing a higher degree of expertise and probably a pay structure in there as well. Finally, then we get to an executive consumer who is reading the data. And of course, reading the data is where they're trying to make decisions. So a manager might say, should I put umbrellas on sale today? And the key variable might be whether it's going to rain or not. Well, these would be the results of the rain prediction and somebody say, 40% chance of rain, oh, sure, I can risk selling umbrellas at below cost in order to get people into the store here. So the key out of this is that there's going to be different interests depending on where you are on that spectrum. And if you notice where the lines cross versus where they start out, for the operational interest, the interest is highest when the data is most recent. Whereas with the analytics, the interest is higher when the data exists for a period of time and knowledge can be gleaned from it. Additionally, transactional data, which is for operations, tends to be right optimized in meaning that we're trying to get the data collected as fast as possible and put out into storage. Whereas with the analytical side, we're trying to get it read optimized so that we can read the data as fast as possible. And we're not too concerned about collection because we already have it. Again, you're speaking in broad terms, but operational right, analytic read. Correct. That's really where we're trying to get to on that. Excellent. All right, good. We talk about specifically standards used in here as well. And most everybody can picture a standard. It's not a lot of fun work, but it's definitely necessary to have it done. Again, if we didn't have TCPIP or HTTP, we would not have the World Wide Web and other things that go on with it. So we're real happy about them, but even within standards, there can be variants of it. So here are a bunch of different concrete blocks in order to do this. Now, standards are what allow you to build these analytic capabilities because if you don't have the standards well-defined, the things you're looking at become less squishy. I mean, less concrete, more squishy. Squishy-er. That's the word I was looking for. So here's an example of somebody might get to do something like this, which is a Fed analysis. Again, we're just going to pretend that somebody has said, I need to look through your customer list. Okay. What customers are you interested in? Well, I'm interested in the list of customers with income of less than $100,000 or customers who are younger than 30 years. I'd say they could start a 29-30-year thing, right? That's what we're talking about. Not that either of us are on dating sites or anything like that, right? List of customers with income of less than $100,000 in 30 years. Oh, that's right. I forgot to tell you New York City. We want them in New York. Okay, cool. So we pick up New York, right? So we've gone from 30,000 customers in the first query to 6,000 in the next one. I'm going to add one other field onto that and say I want customers who purchased something in the last seven days. Maybe I'm doing a food recall or something like that. And if I'm doing a food recall, once I have those 800 customers down, I can go back to the 40 suppliers, right? So again, we've taken this data set and sort of chopped it up in an analytic framework to come up with the answer of these 40 suppliers, supplied goods and services to customers who had income of less than $100,000 or who were younger than 30 years and live in New York State and purchased something in the last seven days. That's never going to be an operational question. The operational questions are going to be, did somebody buy something with this credit card on this date? So you can see the difference between those two pieces. We're going to take the last 13 minutes here and talk to you a little bit about how most of us see these data structures coming into play. And the answer around this really is positioned in a warehousing context. However, what we're talking about can also be applied to transactional systems, et cetera, et cetera. Warehousing is the easiest place for people to understand all of this stuff. Okay. The wrong one. There we go. There we go. In the transactional processing, very typically we have our master data, which is the sort of things that are used across a variety of applications. I'm trying to think of the word. That's all right. I do it all the time. Kind of get lost there. Master data is going to be these things that don't change as much as the transactional pieces. Correct. So my social security, for example, would be a master data item that would keep track of all that stuff. Yeah. It's the sort of thing that you know you're going to be needing over and over and over again, and you don't want to replicate it across all your different systems. So you basically store it in one place, and it becomes kind of that system of record. And we use it routinely for our transaction processing in order for, as mentioned before, to increase the speed with which you can do your transactions, which varies a little bit from the data warehousing. That jabby me in the ribs where you turn the page, right? I think we're... One more be on that. There we go. One more be on that. Where we have our operational data stores, where we do most of our analytical processing. Oh, I got to check my page too. That's all right. So the key for this, though, again is to understand that different data structures are required for different sets of data. And the easiest way to envision it is the one, the diagram that Tom and I both talked about a few minutes ago, where you've got this spectrum where the right first, right is the most important criteria. Trying to process those transactions as fast as possible. You've got an Internet website that's pulling in invoices, trying to sell concert tickets. You need to process these things almost instantaneously, a fraction of second type of thing. As opposed to once the concert's been sold out, what do your demographics look like? What market should we go to for this? Do you guys make money on the dog stuff? Not you personally, but is there a dog showing industry that makes... Strange as this may sound, regardless of how expensive you think a dog is, it turns out it's a break-even at-best industry. It is purely hobby. We call horses holes in the ground that you pour money into. But how do you make a small fortune doing it? Well, you start out with a big fortune. So it's a really key, though. What you're saying is somebody may come back and look across where dog shows had been... Indoorically. And then say, oh, that's a good city for a dog show versus a bad city for a dog show. Correct. I know nothing about dog shows, so I'm making that part out. In the dog show world, it's more along the lines of you trying to put as many of them as close together as possible. It's referred to as a cluster of shows. So you can spend the weekend or the weeks or whatever. Correct. Because you know a lot of people will be traveling several hours to get there. You may as well have them have a lot of activity when they get there. So when you go, you're going out for a couple days, right? Usually. Yes, usually. It's not uncommon for three days. Two days is typical. Three days is common. Four or five days is not unheard of. So you'll hear this ODS versus an enterprise data warehouse type of technology and then description when you walk into an organization. And people will oftentimes use these words correctly and oftentimes use them incorrectly. Right. I'm more familiar with the enterprise data warehouse where we actually do transform the data to be a more usable form. The transactional systems, as I said, are optimized for throwing the data into the database. But somewhere along the line, you really do have to pull it out a bit faster. And it does actually require making things a little less efficient in terms of writing, but that process makes it very efficient for reading. You can get your reports much faster when properly set. And we'll do one more on these, just two more. Sorry, I'm there to get the last little bit out, which is to say that there are data marts and event data stores and also metadata stores that we can get to given that and that all of those are going to be back and forth. The question is, where is it really appropriate in order to use these? So I'm jumping ahead to the slide here. We'll just go ahead and put all this up. Again, just talking about the two extremes, operational on one end, analytic on the other. The operational is going to be more subject-oriented, that it's going to be more volatile because things may change. You may have airline flights that are scheduled to go and then airline flights that aren't scheduled to go, all of a sudden, things like that. When we talk about atomic, it's the lowest level of granularity that's in there and what's the current value of that, which represents exactly where it is. Go over to the analytics side. We're looking at more integrated, more less volatile data, more non-volatile data that's in there. It's data that's probably been aggregated and it's probably been equalized from a time-variant perspective. So for example, one of the things you'll see in retail is that they all want to use the same retail calendar so that my march will be the same as everybody else's march. In the analytics side too, and this is something that I think a lot of people don't really think about, if you have a nationwide retail store, maybe you want to see what your sales are hour by hour, but you're looking at sales across time zones. Do you want to normalize for that? Because you might want to say, well, three o'clock here is not three o'clock there. That's going to make a difference. That's going to make a difference. Absolutely. But you cannot do that sort of thing transactionally if you want to analyze it at the final end. So what we're going to talk about for this last chunk then is how to put all of these together. There's really three options that you're facing. The first one is what we call Inman Slaver. Bill and I are going to be on a program in Chicago this weekend, so we're looking forward to some of that at one of the universities coming up. But Bill has been credited along with being the father of the data warehouse movement. He's the first person that sort of saw this from a production perspective. Has written, like Shannon said, I've got my tenth book out and I'm real proud of that accomplishment. Bill is up to book number 64. Yeah, I just, my hat's off to him. I'm in awe of him every time I work with him. And to just put a plug into his 64th book is called Turning Text into Gold, which is one of the better books on text mining out there. The second dimension that we talk about that is Kimball, which is dimensional warehousing. And again, it represents a flavor of organization in there. And the third one is Data Vault. So we've got about six minutes. We'll go through them very, very briefly. First one key, again, is third normal form. The idea is, let me just put this picture on up here. We could talk about any of these, but I want to just explain it to you on the picture because it's a little bit easier. There we go. That thing in orange in the middle takes data from disparate data sources all over the place. It may come in off of mobile. It may come in off of the web. It may come in off the cash registers. But it puts it out there in that most flexible and adaptable format. We'll call that third normal form. You can actually get a little further than that with analysis work, but it's a very good place. And the goal is then to have purchasing, sales, and inventory, the three that we've got in this example, be able to all access that because their systems are optimized and able to access it. Sometimes that's a happy outcome. Sometimes it doesn't work that way. Sometimes not. Absolutely. Catch Tom and I for a drink and he'll tell you stories, right? So the second design style then is dimension. And really here, what Ralph Kimball did was he looked at it and said, you know, the things that Inman has done are really good to get a perfect, complete picture. But sometimes you don't really want a perfect, complete picture. Sometimes you just need to know what is store data look like. This is more analytical. So this gives us what we call dimension. And what you'll see in here is that the center table of these dimensional things is the thing. And then there are some variables around them. You can change the location. You can change the customer type. You can change the period of time. There we go. Time dimension or the product. So you can give me the top 10 products, the top 10 stores, the top 10 customers. Whatever your analysis allows you to do. This works well if you design the data to be done in this fashion. Very well if you design it in this fashion. It also tends to be very point in time specific. You can set this up to be historical in many ways. What you see here on the slide then is that same sort of staging area in the middle. It plays the same function as things are moved into it, but we're doing a very different organization of the data. And then we're putting it out and we're replicating it in many cases. So that the sales data mart is different from the purchasing data mart is different from the inventory data mart because those are optimized for their individual users. That in particular scenarios, a couple examples might be something like, I'll find my examples here, one year of sales data for the northeast region only or the data for your largest customer only. It breaks it down so you can get much faster responses. You don't need to see everything. You just need to see a specific section of the picture. And if your requirements are well defined then this is a good area to go off and look at. It's a bad area to go off and look at if your requirements are ill defined. Quite true. So the third model we'll fix here just with three minutes left here again is the data vault style. And then once that really gets credit for it, although Bill Inman has absolutely endorsed this as well, the key here is that the relationships allow you to design a more flexible area. And again, I'm going to just go to the graphic slide that we've got here that shows the whole thing, which is to say that we have a staging area, but that out of that staging area we could then set things up to go into an enterprise data warehouse from which we could generate star schemas, more marks, in this case we're just showing error marks on here, or reporting architectures that come off of this. We're just going to say the staging area just tends to be a scratch pad location where you can pull your data together for simple processing. After you're done you can pretty much trash it and wait for the next set of data to come in. Of course nobody ever trashes anything because they haven't before. Nobody ever trashes anything. The Internet approves that. The Internet approves that. There are some hybridized structures that you can use in here. We have not covered things like surrogate keys and systems of record and all this sort of stuff. There's a lot more on this topic that you can dive into, but what we really wanted to do was let you sort of get a sense for why these data structures are important because if you're trying to build an absolute historically perfectly accurate enterprise data warehouse, it may be that the star scheme is not appropriate for that. Or vice versa, if you're trying to get to some quick reporting, the enterprise data warehouse may be a too big a bite to chew off on that. Yeah, the enterprise data warehouse is, people tend to think you have to have one database and it grows to enormous proportions and that actually doesn't work much of the time. Eventually they get so big that they fall over, don't they? Yeah, I've seen it where the actual hardware is no longer capable of servicing it. Oh my goodness. Even at Walmart I never saw that. Well, when you're working with smaller companies, that happens a lot. That happens a lot. So just to summarize on this, again, a data store that somebody may call a master data store. We really define that. We've got a whole webinar we do on that one. But there are different audiences that they're going to serve given that particular piece. If you're doing online transaction processing, they're different. The whole point of this discussion is to say that you can't just take one data structure and make it work for all of your needs. It won't work given that. And that if you do want to employ some of these things that are at the top of the data pyramid that's in there, you do need to spend time and find some expertise that can help you get to the point of understanding which of these things that you want to put together. Why? Because it's all related to strategy. If you can't tell us what the strategy is, we can't help you with the data design. And we're right at the top of the hour. So Shannon will turn over some questions. Thank you, Peter and Tom, for this great presentation as always. And just a reminder and to answer the most commonly asked questions, I will be sending a follow-up email by end of day Thursday with links to the slides, links to the recording, and anything else requested throughout the webinar. Now, just diving right into it, is it possible to make various blobs of unstructured data useful and be able to link them? Well, I think that first of all, if you don't ask the question, the answer is always no. It's kind of a blobby question, isn't it? It's unstructured data. There's no real way to pull it into one form unless you start throwing classifications. Hashtags are a very common example. When Twitter started hashtagging things, it was able to key in on those hashtags even though there's no easy way to say, what is every single solitary tweet about? So they'll let the people who write them decide. And I'm back here on this slide, Tom, where you made the point that very often the process of people saying they are turning unstructured into data into structured data, which really doesn't exist, but it's really more about that packaging, that tagging that you were just describing. Exactly, because somewhere along the line some human has to eyeball what they're looking at and classify it. Well, you really haven't done anything with the data itself. All you've done is put a wrapper around it and it's affected your actions on that wrapper. It probably made the data more valuable, which is really what we're trying to do in the first place. It is more data. It is the metadata that describes your unstructured data. You've got to be careful. I've got a lecture I do where I say there's no such thing as metadata. It's all data, right? It is all data. That makes people have to explode. We won't say that on this one. Thanks, Shannon. Great question. To slide 41, slide 41 is about monolith v microservices, correct? On slide 41, I don't know if that's the intent we had, but it certainly could be used to talk about that a little bit. So let's define both the terms. Monolithic is going to be one sort of thingy, plug into the electrical grid and you get electricity. On that microservices now are going to be, it's a variant of SOA that came out. Many people remember that. It's the idea that I can get names and addresses within a work group or maybe in a department or maybe if I get really good across the entire enterprise in order to do that. Is there a question there? She said this was about microservices versus monolithic. I thought it was more about monolithic data versus services. I can try that too. I'll write back and ask for clarification. Many years ago when I was in my database class in college, they talked about how there's a cyclical nature of data storage and they tend to cycle between centralization and diversification, where it goes out to a variety of different sources. We've seen this over time with the Internet in many ways. Then it came back to the big data centers and then it moved back out to the cloud and then it came back again. We've seen it with things like Napster and now all of a sudden it's back to Apple iTunes and that sort of thing. This is going to happen continuously and we're at the point where it's starting to move back outwards again. What this is really talking about is do we want to pull everything together into one big centralized location? In some cases maybe. As the prices of the outside forces starts to climb, then the centralization starts to look good again. As the contrary nature of software development, whatever year is good now, somebody is going to find a way to make the other better. Excellent. Very interesting. All right, Shannon. Thank you for describing third normal form and why one would use it. Can you also mention why someone would also look to implement fourth and fifth normal forms? Let's go back and make sure. I don't know that we actually tried to define third normal form but I'm glad you found their definition. Trying to normalize data is sort of an interesting oxymoron because in normalizing the data you end up with more data tables and that's counter-intuitive to people but you're actually simplifying your data structure. As you move into fourth and fifth normal forms you're starting to do what's known as abstraction. You might have an address structure but everything or many things have addresses too such as customers, vendors, employees, the company itself. When you start abstracting out your data structure starts to be used across all of those. Well, if you have one table of addresses well now you need to know which table is that address applying to. Is it applying to your customer's table or is it applying to your employee's table? And when you start abstracting to that level it starts to get really complex very fast. So we like to have analysis done and oftentimes data represented at third normal form because by definition it is most flexible and most adaptable for a data structure to be in third normal form. It means that all of the attributes in that data structure are fully dependent on the entire key. The key, the whole key and nothing but the key so help me cut. Did you ever get that one? No, I actually heard that one. Be glad. But what that means is that you've done the data in a way that you understand it fully. You may not want to implement that design because price quality speed. Price quality speed. Going out to the higher levels of abstraction actually requires higher levels of expertise as well which means you're starting to increase your labor costs. Third normal forms are good when you want to minimize the amount of redundant data and that brings everything into a far higher level of efficiency. When you start going into the layers of abstraction now you're not just making your data efficient now you're making your actual architecture more efficient. It works well if you can design it and keep it going but it's complex. It's complex. What we like to recommend organizations and the partners that we end up working with is that if you get data in third normal form and there's still some questions about it moving the analysis into fourth, fifth and what we call business normal form can help to drive out certain types of business rules that are in existence that you wouldn't see at a third normal form model. But the idea of implementing some of those data structures is beyond crazy in this case. It can get rather long. We might want to do a series on just normalization because I think an hour of normalization would be really valuable to have a good explanation out there for somebody. I say that because many of the courses that teach this stuff don't do it very well. No they don't. Thanks, Shannon. Where would you place Data Lakes in your summary chart? That's a great question. We avoided that question knowing it was going to come up but it's a really good one. Here's the typical topology that most people think conceptually of in here. There's now a new movement. Again, I'll go call up Bill Inman because he's also got the best book on Data Lakes that's out there as well. Many people are now building another structure in parallel to these, so I want you to picture another line underneath the green line and the yellow line as a blue line, blue for water for data and the Data Lake. They're saying, look, we can just fill everything up in this Data Lake and it will actually work. What we're seeing is that Data Lakes have a use and there's a really good use for them which is kind of like a double check. If you've got systems and data and things that you're thinking about and let's pretend that you're on the analyst and at the spectrum that we showed you there. You come into the boss and the boss I found a pattern in the data. I think there's something here when it's raining we sell more umbrellas. Bad example, but we'll use that. The boss says, okay, well that sounds like a good idea. Are you really sure about this? We double checked our data a bunch of times. It's a great place to go and find for confounding variables. What you could do there is to say go out in the Data Lake and actually ask somebody to do a contrary analysis. Find some things that are out there that show that we're not going off in the right direction, to try to prove the null hypothesis. This is really what it comes down to. A Data Lake has a variety of definitions but most people are thinking of it in terms of big data technologies which is something that we can objectively identify. You're going to be using parallelism. You're going to be using lots of streams. You're probably not going to be paying payroll from your Data Lake. But if you're doing investigations at that analytical end of things, combining your warehousing data with some data in your Data Lake can tell you whether building that capability into your Data Warehouse will be a worthwhile exercise. We've also used Data Lakes to help build enterprise data warehouses where the requirements are very, very squishy. People are saying they know they need to move forward but they don't have time to get the requirements down. So just build a something. That's a very good idea. Yeah, that tends to be a situation where they don't really know where they want to go. They've just kind of heard they have to go that way. I want one. Yeah. I had an insurance company executive once that told me they wanted telemetric data. And I asked the question, what's happening in the insurance industry that you need to have this up to the instant type of thing? It turned out nothing. His buddy in another industry had one and he was all buddies and that's why he wanted one too. So many buzzwords in this field, right, Shannon? Just a few. So there's a lot of great questions coming in. So just keep them coming in in the Q&A section in the bottom right-hand corner. We've got plenty of time here. 20 minutes left for the Q&A portion. And there's a question on there about analytical data structures in Agile. But before I get to that, while we're on the topic of where things fit in, where do you see NoSQL databases fitting in? So again, sort of the same type of thing. You may use NoSQL databases as a way of implementing the data lake. There also are other uses for NoSQL besides just vast, you know, data analysis types of questions. Again, there are certain types of processing that are going to be working there. Let me go back to my five types of database slides and then talk about how the big data technologies fit in here. Each of these data structures needs to have some level of rigor around it. You can't put a numeric field inside something that's labeled, only for characters. You can't put some characters inside something that's labeled for only numbers. You can't take and join a relational database table to something that doesn't have the appropriate foreign key structures to it. So all of these things, you have to know a bunch of things about the data before you can actually design this. That's been a source of frustration for years because I hope one of the things Tom and I have pointed out today is that there is not a lot of expertise outside of our community, even in IT, that knows how to recommend these types of data structures. And so consequently, when you have this known, you can then design the type of thing for it. But we were just talking about an example where somebody says, I don't exactly know what I need, but I know I need something. That's where no SQL database might actually end up being a very good place of doing it. But I would look at that more of a prototyping activity designed to, say, show people things and say, is this it? Now that's not it, but if you made that a little bit more blue and this one a little bit rounder, that would be closer to it, so then you go away and get with this iterative development process. Anything you want to add to that? Well, first I ask the question, did he say no SQL databases or SQL server databases? I think the question was no SQL databases. No SQL databases are good for a storage location for your unstructured data, pulling in all your PDF documents, your artwork for some of your products, that sort of thing. Again, you'd have to tag it with certain pieces of information in order to bring in some form of container that you could draw upon so you could take a look at the relevant pieces of information that you want to pull together. I was recently working with a client recently, recently working with a client recently, that works, who had a lot of unstructured data and in particular they did have a lot of artwork. They routinely pulled their artwork from a third-party vendor, slapped some tags on it, and then promptly shoved it out into a database. And if they needed that artwork again, they'd call out to that third-party again and have to reclassify the data again. So it was rather inefficient. So they're looking for some sort of solution to pull it all together so that they don't have to do that over and over and over again, particularly because they're under government supervision. Those were our taxpayer dollars that we're going to do. Yeah, and the government supervision requires them to provide information in many ways. And yeah, they wanted to get it more efficient, so it wasn't taking them several weeks to pull the data together. They wanted to get it down to a couple of hours. So I'm going to say something that sounds like I'm insulting Tom, but I'm going to say we both have lots of flat forehead for having to run into walls. Bang our head against the table enough times. On that project that you just described, were you able to successfully implement what they wanted because that's clearly a better solution? The requirements for that was to actually pull together and give a suggestion that would help them achieve the goal, the actual implementation was not part of our requirements. But it was nice to be asked. This is actually the type of work that is really near and dear to our hearts is when we can make an improvement, not just in the data, but in the business or the organization that's affected by the data. Correct. In this particular scenario, they were spending literally hundreds of man hours to pull everything together. Another one of the pieces of unstructured information was magazine articles, website articles. And of course, when you have to reproduce these same results a year later, those websites might not exist. So you have to pull them in and store them so that you have them at all times. And that's very unstructured data. You don't know what each individual article is really discussing unless you tag it with some relevant piece of information. So really, the takeaway from that question, it's a great question, is to look at your topology. And I am not suggesting this is the topology that you should use, but whatever your topology happens to be, where does that slide go? They will go down. Take that particular topology that we have and say where in here could less structured data processing benefit the organization. And that's the role you should look for for your data lakes, your NoSQL databases, et cetera, et cetera. And then loo that as an adjunct to what you're already doing because this, whatever you've built, costs you a bunch of money to get there. And there's no point in throwing it all out and saying we're going to start all over again because nobody ever gets a chance to do that. You know, let's just stop business processing for three months while we rebuild all of our IT systems. And Google will be back online in three months, right? Yeah. That's why a lot of times you may see that applications are built with the intention of having three to five year life spans for an application. And you find out it's probably close to eight to 10 years. Or maybe longer. That's why we got in trouble with why she can't so we're headed for this next bug that you were talking about. Two billion. That bug is mainly related. It's not actually a bug. It's a design issue. When you design your database, you have to say, well, I expect to have X number of rows in this table. And if you say, well, you should never get above two billion. And then you find out, no, it's actually going to go close to 10 billion. Well, then you have to accommodate. You have a problem. You have a problem. Not many people that can help out. And the worst part is, it's usually the primary key field that you've been talking about recently. And when that happens, you have to reassess many, many tables. And it is a monster problem. So I got one for you. I'm working with another group now and they've decided their customer number needs to be expanded to a larger number. And they've managed to count 140,000 systems that they're going to have to go back and inspect. Yeah, that's going to be a few years where the work. A few years. That's exactly right. Great question, Shannon. So what are the best ways to build out analytical data structures in an agile manner? That's a fantastic question. I really like that one. Agile, of course, meaning that you are progressing step by step and checking constantly to make sure you're headed in the right direction. As opposed to just building faster. Or just the waterfall method for software development, where you actually just keep going with the requirements until you're done. Oh, look, we forgot a requirement. Well, too late, it's going to have to be in version two, which may or may not actually occur if version one is not successful. Yeah, your agile systems and processes are a good continual checking point to make sure you're on the right path. In terms of setting this up for, what was the last part of that? I think the question had to do with analytical. Analytical. One of the most common problems, particularly in small companies, is they don't have sufficient funds to make numerous server systems. So as a result, you may find that your analytical systems get put on the exact same hardware as your transaction processing. Could that be a problem? Oh, yes. I've seen it before where one query on an analytical side brings the entire server base to its knees, causing all the analytical processing to stop that in its tracks. This is a huge problem, and it's a known problem that all you have to do is separate them so they're not physically attached to each other, and that requires difference at the hardware. Unfortunately, that's expensive. Again, a great question. Let's take it to another step, though, which is when you're trying to build these systems, which is really where most of us are involved, people say to us, okay, so if I need an operational system, they can define some very precise requirements, and we can look at the answer and say, you were supposed to build this and you built it. It looks like it works. We're great. Thank you very much. And we actually came in under budget within schedule with full plan functionality. It happens one in three IT projects. The problem with most analytical projects, right, exactly, yes. Really high success rate, yes. One in three. The problem with most analytical projects is that the best outcome from that analytical project is another question. If it's done correctly, yes. You should recognize it will generate many more questions than answers. So this drives IT nuts. Because they're saying, just tell me what it is you want me to build. And analytical is not. I want to build anything. I need to answer questions. And when I find the answer to those questions, I'm going to come back and ask something else. So the agile cycle makes perfectly good sense. As you said, properly implemented. Many people think agile is just about doing it faster. No, not at all. You correctly implement, again, checking multiple times along the way and incrementally building out new capabilities. That's what agile is all about. And pivoting when you're not going in the right direction. Stop right away and pull something out. I've always said, if you're in the middle of a software development exercise and you discover a data requirement is wrong, you need to stop spending money on software right there. You've got to go back, make sure the data structures are stable, and then come back forward. So given that an analytical process is more about this, when the customer comes to us then and says, hey, can you build me one of these? We have an interesting conversation with them because we know we're not going to build them something and it's going to give them the answer. What we're trying to do is to talk to them about developing organizational capabilities so that they can become more data science-like so that they can do more self-service so they can do more things on their own and learn how to teach them how to fish instead of serving them that nice fish sandwich. And in that agile methodology, you have to realize you're not going to stop doing it at any point as those questions become more prevalent and you won't want to answer and it generates two questions, you get another answer and it's two more. That's a cycle, that's a process. And if you don't recognize that, you're not going to build your system to be continually processing that way. And you've got to develop the expertise of the people on the other side to get them how to do that as well. Right. I'm not going to answer one question. I'm going to spend the rest of my career answering questions. No, it's good lifetime employment, but... And if you do it right, then you could actually build it so that they can do it on their own. That's the real beauty of it. I hope we answered the question, Shannon. It was a really good one. Hi. We'll definitely look for an upcoming or a further deep dive into that question. I don't see anything coming up yet. Do you have any suggestions for metadata standards for unstructured data? It would seem to be an oxymoron just based on your description there, right? With the unstructured data, I always call it just simple homogenization. Yes, you have a lot of documents. Well, all the documents are going to have the same general characteristics. What's the title? Where is it located so that you can go retrieve it? And before you know it, as you just take a look at general articles, what's the timeframe that it was pulled in? What's the timeframe it's relative to? What are the topics it discusses? These generally become a little more obvious as you do it a few thousand times. And when you structure it according to that, I think it becomes a little more obvious, but it's never going to be one-size-fits-all. It's going to turn weather into wine kind of thing. It just doesn't happen, but we can put labels on the bottles. We can put the bottles in shelves. We can put shelves in the cooling temperature so that it'll be at the right... Sorry, we were talking about wine at three o'clock in the afternoon here. As I keep telling my wife, the more I drink wine, the more I tell myself, I should drink more wine. There you go. That's all the questions I see coming in currently. I'll give everyone just a minute here just a reminder I'll send a follow-up email by end of day Thursday with links to the slides, links to the recording of this presentation. Oh, we do have another question coming in here. I would like to take the last statement from the answer. Can you answer a follow-up question if it was done right? How to do it right in Agile? Maybe we need DeepDive to talk about specific backlog items and sprint planning. So a great question. There are a number of books. I think I wrote them forward to one of Larissa Moss's new books on Agile Data Warehousing Methods. And it talks about this in more detail. I think the way Tom was describing it, though, really the words that you said were iteration and a lot of checking. Yeah. And if my dentist told me I had a one-in-three chance of pulling my tooth out, I would go find another dentist. For some reason, we're allowed those standards in IT. One of the nice things about Agile that I have always liked is it starts talking about least viable product. It is that one thing that you need to produce that the customers want, but that's literally all it is. It is the most simplest of simplest applications or in this case the most simplest of simplest data warehouses. You need to have some data there that they can pull out, but it is extremely simple data. And it's something you could arrive at fairly quickly. And if it turns out you put in data that nobody wanted, well, then you weren't really checking to make sure you were headed in the right direction, were you? So as you continually build, you check in with the customers, keep checking in, keep checking in. And as they begin to understand it, they will obviously start to use it. And you mentioned the one key driver is that you've got a business user on the other end who's looking for some specific answers. Yes, it could be very niche related. It could be very simple. It may be something that is not widely used, but at least it's something that is being used. The key for all of this is to learn to exercise those muscles and this capability that we don't. How many times in your career do you get to design a real significant database? Well, a typical IT worker might get two or three I'm going to guess you probably do two or three a year. If I'm lucky, yeah. Exactly. So your level of expertise at doing this is clearly in order of magnitude over the standard person who's actually paid attention and tried to do a good job on it. Yeah. The old days of working from one set of requirements going off and doing all the development and coming back with the end product, that just doesn't work anymore. That's what leads to one in three projects being successful. I was going to say, are you nostalgic? Do you want to go back? No, I would not like to go back. We had our problems back then, too. We'll tell you about it someday, Shannon. I love it. All right. Well, that is all the questions that we have coming in for today. Peter, and thank you so much as always. Tom, thanks for joining us. Great pleasure to have you on board with us this month. And thanks to all of our attendees for being so engaged in everything we do. We just love all the questions that come in, all the great engagement with these webinars. I'll put a note in there. I let people know, Peter, we're looking at 2018. There was a big thumbs up on a session for normalization. And if anybody has additional suggestions, you can just reply to the follow-up email or email me at Shannon at Dativersity.net. I hope everyone has a great day. Thanks to all. All right. Shannon, thank you.