 Hello and welcome. My name is Shannon Kemper and I'm the Chief Digital Manager for Data Diversity. We'd like to thank you for joining today's Data Diversity Webinar. Data modeling is fundamental. It is the latest installment in a monthly series called Data Ed Online with Dr. Peter Akin brought to you in partnership with Data Blueprint. Just a couple of points to get us started. Due to a large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them via the Q&A in the bottom right hand corner of your screen. Or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag dataed. If you'd like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the bottom middle for that feature. And to continue the conversation and networking after the webinar, just go to community.dativersity.net. And to answer the most commonly asked questions, as always, we will send a follow-up email to all registrants within two business days containing links to the slides. And yes, we are recording and will likewise send a link to the recording of the session as well as any additional information requested throughout the webinar. Now, let me introduce our speaker for today, Peter Akin, is an internationally recognized data management thought leader. Many of you already know him or have seen him at conferences worldwide. He has more than 30 years of experience and has received many awards for his outstanding contributions to the profession. Peter is also the founding director of Data Blueprint. He has written dozens of articles and 11 books. The most recent is your data strategy. Peter has experienced with more than 500 data management practices in 20 countries and consistently named as a top data management expert. Some of the most important and largest organizations in the world have sought out his and Data Blueprint's expertise. Peter has spent multi-year immersions with groups such as Diversis, the U.S. Department of Defense, Deutsche Bank, Nokia, Wells Fargo, the Commonwealth of Virginia, and Walmart. And with that, let me turn everything over to Peter to get today's webinar started. Hello and welcome. Hi, Shannon, and welcome to everybody. And Shannon, thanks as always for hosting us on this week. So look forward to working with you guys all the way around. Let's just jump right in. The title of this program is Data Modeling is Fundamental. And that is really the contextual key to this. I used to call this data modeling fundamentals, but people weren't kind of getting it. So we tried to improve these each time and get a little better with it. So I'm going to start out today with a little bit of contextual overview, just to tell you about the newness of the profession and how it is relatively misunderstood. We'll talk about some motivations. And in particular, we'll talk about the idea most of IT is integration of systems and various systems components. And the data portions of this are simply not well understood. Then we'll dive in to look specifically at what is data modeling and why do it. Understanding that data modeling presents the understanding, the shared understandings of our fundamental foundational system characteristics. There's stuff that's really core and there's stuff that's peripheral. Data is core to this. These data models are shared between systems and humans. And then we'll talk about some very specific concepts in here about how some of this data modeling can be done even better. In particular, we'll look at the power of the purpose statement. We'll try to understand a little bit more about what it means to be data thinking, thinking centrally about our data. We'll look at data modeling, how it complements the architecture and engineering techniques. And finally, the challenges that exist beyond the data modeling. We'll finish as always with some takeaways, references, and as Shannon mentioned before, the Q&A will get that started right in about an hour at this point. So let's jump in and get started and look at the contextual overview. First of all, what is the world's oldest profession? The answer is accounting. I can make that joke all the time and it's true because I'm married to an accountant. Now, the reason I say that is because accounting has literally had 8,000 years to get its act together and it has its act together pretty well. There are things out there called generally accepted accounting principles. And if you're following GAP, you know you're following a certain way of doing things. About 20% of the companies in the United States do not follow GAP. And the first question anybody asks when they find out about this is why not? It's a reasonable question because if 80% of the companies are using GAP and 20% are not, there may be something up there. Our profession only dates itself back to really the beginning of the 19th century. Our heroine, if you will, is Ada King. She's the Countess of Loveless and she has a little thing that she really invented the Gartner hype cycle curve in here. But the thing I like to use is that it's appropriate for us to acknowledge that we're not as mature in our discipline as accountants are in theirs. But it's not okay for us to stay that way. And this is really the purpose that we have between these educational offerings that we do out of the university here. So let's dive in and start talking about, okay, if in this case data modeling is a part of data management, what's data management? And data management, we've generally defined it over the years as what happens between when we get data from a source and when it's used. And anything in between there is data management. Now we've got a little bit better about our differentiation here. We've said that there's sort of three things that go into it, data engineering, data storage, data delivery, or three areas that we look at. And around that we need to put in some data governance to make sure that we actually have some guidance that goes on with this process. Because of course, if it's done without guidance, it means there can never be a wrong answer. And we know that there are definitely some wrong answers out there in this. It also means that we need to have some specialized team skills that come into play here that you need to be professionally certified in some of these areas because, A, some of them are taught incorrectly and some of them are simply not taught at all. Now, even this diagram is insufficient. And I say that's fully criticizing one of my own diagrams here because it doesn't really depict what happens with data under best circumstances. And that is not that data is used. This looks like a production function. Some people look at this and say, ah, this is why we should call data the new oil. I don't like that term at all. And the reason is it doesn't well depict data reuse. This diagram that I'm showing you still does not well depict data reuse. And I tried another version here by putting the data in the middle of it. Again, specialized team skills. But what it really says is that everybody in your organization is doing data to some aspect or another. And if we don't understand how that's being used, we have no ability to improve the overall process around this. And so this is really what we'd like to talk about in terms of data management. Now, let's go a little further with this. Another contextually important thing from data management perspective is the data is a lot like Maslow's hierarchy of needs. Everybody in the world apparently learned about this in high school, and that's great. Maslow made a very good case that if you had food, clothing, and shelter issues, you were never going to be safe. And if you were never going to be safe, you would never be part of something that was larger than yourself. And if you're not able to differentiate yourself from the larger whole, then you will never have self-esteem, and you'll never get to self-actualization. So this is a tree of necessary but insufficient conditions for moving your way up the ladder. And data is an awful lot like that because everything we see advertised in data is in this golden triangle here. Master data management, if I was going to update this chart, which I probably won't bother because it's so much fun. But you end up with things like blockchain and other things that go into here as well. Everybody wants to try those things without understanding that that's just the tip of the iceberg and that there's a lot of foundational practices that need to occur underneath it, starting with the idea that you have a strategy that shows you what governance, quality, platform architecture, and operational activities that you need to have. Notice also that the things in the gold area are technology-focused, whereas the things that are foundational are capabilities of your organization. And too many organizations that we see try to always start with the technology. We argue instead that technology should be the last thing that you do. And I'll show you some things in here about data modeling that help you out in that area quite a bit. Now, the last thing that people get from this diagram is, oh, okay, this is great. And Peter, I know you heard say this before, but can you do it all faster for me? And the answer is, of course, yes, we could absolutely do data faster. But if you do it faster, it'll take longer, it'll cost more, it'll deliver less, and it'll present greater risk than if instead you do it in a crawl, walk, and run type of an approach. Something else that's even a little bit more scary, I saw this on LinkedIn just the other day, it was a recent technology revelation, which was a little bit of scary, but the revelation I'm showing you here on the screen is that if you have chocolate ice cream that goes into anything awesome, you're still going to get chocolate ice cream out the other side of it. And the interesting piece on the LinkedIn was the individual said, well, that is true with blockchain and it's true without blockchain. And my only question was this is a recent revelation. I'm just assuming the person who wrote that is very young, because if they didn't understand garbage in garbage out before, they certainly are going to understand it well after you put in a really good awesome blockchain, and then load it full of really, really bad data. Garbage in garbage out is one of the most important categories in this area. Doesn't matter what you've got in the middle, could be a perfect model. It could be a data warehouse, machine learning, business intelligence, blockchain, AI, machine learning, MDM, governance, analytics, doesn't matter what you call it. If you've got garbage data, you're going to have garbage results. The only way you can improve that is by improving the quality of your data. And notice that there's another piece that happens here as well. If you don't do this type of process, first, you lose the opportunity to share data to learn that two of those rings that I've circled at the top and the bottom there could be the same data chains. We only have to maintain one data chain in order to feed both our intelligence and blockchain pieces and our AI and technology pieces. So we've simplified that piece overall and we're really trying to do, of course, just come up with better results. Well, again, lots and lots of words around all that. But what it really comes down to is understanding that from our understanding of the data management body of knowledge, which has only been out since 2009, that architecture and modeling are critical pieces. Now, as Shannon mentioned in that little prelude to this, we have an online architecture conference next week that should be a lot of fun to do. Today we're going to talk about data modeling, which is that data development area. And this is the activities and things, if you will, our roadmap to what we're going to be talking about today. I do not expect you to read this, but do hang on to it. It's a very good piece. It does not exist actually anymore as a component. So data modeling is really pretty fundamental. And what we're trying to do is understand from an architecture perspective, first of all, architecture is here whether you'd like it, have it or not. Anything that is reasonably complex is going to have an architecture. The question is, do you understand it? And do you have it documented? Because if it's not understood and it's not documented, it cannot be useful to you. And that is a huge important piece. The data architecture really talks about higher levels of abstraction. They are further up on the food chain in terms of levels of extraction. It's understanding it's an integration type of focused activity. And the data models are more downward facing. They are detail or implementation focused. And the models are literally, literally the translation between people and systems. So let's take a look at how that works. If we're trying to express components as an architecture, the details are organized into larger components. We don't think about looking at a door and saying that's a doorknob and a hinge and a piece of wood. We talk about it as a door. Okay, that's details organized in a larger components. And there's a lot of intricacy that can occur in this. Those larger components can be organized then into larger models. The door can be part of a room, part of a house, part of something else. And now you're introducing a concept called dependencies in there. You can't put a roof on a house if the house doesn't have any walls on it. And finally, those models are organized into architectures that are comprised of these various architectural components. And this is where we get the word purposefulness that comes into here. If we don't have a purpose for it, then any answer is the correct answer. What we're attempting to do, of course, is express these data models as intricate pieces that show the dependencies and optionalities that we need to have and the purposefulness of what we're trying to do. And the small things that we organize are called attributes. We organize attributes into entities and objects. These attributes describe the characteristics of things that someone cares to keep information about. So examples might be color, size, sequence, media code, product descriptions, etc. etc. The entities are organized into models. They're combinations of attributes and entities to represent information requirements. These entities and they are roughly equivalent to objects. There's a good equivalency there. Our things about information who are managed in support of strategy. So if our strategy involves customers, then having customer information would be important to this. The entities also show how they react with each other, how they interact through the relationships that we need to have in here. Poorly structured data constrains organizational data delivery capabilities. If it takes longer to get the information out, people will either not do it or do it poorly. So again, from an entity perspective, these may be persons, places, things, easy ways to think about them. And finally, the models are organized themselves into architectures. And when building new systems, architectures are used to plan the development. More often, however, we're not building new systems. Our worlds are increasingly package oriented, and it's still absolutely critical to understand the architectures of the things that you're buying. Because if you don't understand how they work, you can't make use of them to support your strategy. Let's dive in a little bit further. Attributes. Here's an attribute. Club ID. Wow, that's interesting. Are we measuring clubs like golf clubs? And each golf club has a interesting place on it? Well, no, this is actually from an entity called club. So we're going to call it a club of people joining together. But notice how the definitions would be really helpful even at this point. The existence of this attribute that tells us that clubs need to be identified separately from each other. That indication of a pound sign there is a shorthand modeling notation that tells us that the club ID is a key to this. Club specific information is also likely maintained. So just by looking at one attribute, we can start to tell information about it. And we can also tell that there is some concept, probably an organization that exists above the club level. There's probably other information that we'll learn from this as well. Let's take a little bit more definition. Current promotion, years of obligation. Okay, well, these are all attributes that describe instances of business things. In this case, we're trying to find out what members are part of what clubs in order to do this. I mentioned before they are organized into models. Here's one organization. Customers are clearly related to orders and orders are clearly related to products. Sounds great. We're data modeling, right? Well, almost there. Next concept that we need to have is data architectures that are comprised of models. And I do not expect anybody to be reading this. Notice that there are color coded sections. Those may pertain to things like color areas or entity types or things like that. But notice what's happened here. We've gotten very complicated very, very quickly. And unfortunately we've done a very poor job in College and University of teaching people about this. In College and University, you get one course and it's how to build a new database. And if we do not, if there's a skill we do not need any more on planet Earth, it is how to build more databases. Because two things. One, it teaches people that's the only thing they need to know about data. And some of them do learn some data modeling in there, but only a very small smidgen of it. And that the only thing they need to have data skills for are when you're building new databases. And consequently, if we're not building a new database, management's thinking is I don't need data people that's there. That's IT professionals. It's even worse when we look at it from a knowledge worker perspective. My definition of a knowledge worker is somebody who works with data. And yet what do we teach them about it? Very little and what percentage of them deal with it daily? 100%. So your knowledge workers actually need to know some of this stuff too. Let's take a look at specifically why. Here are some numbers from various different organizations that I've worked with. Just take the bottom one there. It's almost 30 billion queries a day. Okay. Now one of those queries might look something like this. Okay. That's pretty complicated. It's not showing you anything specific other than the fact that there's a lot of complexity. And this organization that we were working with simply didn't understand that you could optimize your queries. Now that may not seem a lot to go from this complicated query to this slightly less complicated query. But if I run that query 30 billion times a day, those little bits and pieces add up to what we call death by a thousand cuts. In this case it's death by 30 billion cuts. So the lack of data coherence is a hidden expense. Badly traded data costs organization money. If your systems were not designed explicitly to work together what are those chances that they will just happen to do so? And the answer is nil because data is the level at which they must be designed to work together. Organizations are spending 20 to 40 percent of their IT budget evolving data by migrating it, converting it, or improving it. And as a topic data is quite frankly complex and detailed and anybody outside of this podcast or webinar does not want to hear any aspects of it to talk to you about it. It's just not something that we're doing to do. And most of them quite frankly are unqualified. They don't have the requisite architecture and engineering backgrounds. I've already mentioned that it's taught pretty inconsistently. Most of the focus is on a technology first rather than the business impact and it's not really well understood because it has to be relearned by everybody in your individual group. And I'm going to show you just a quick little video here one of my favorite sort of things. So he's playing the star-spangled banner behind it. It's appropriate because it's the 4th of July. I'm going to talk over him though because he's really good. On the other hand I want you to imagine every knowledge worker in your group learning how to do data individually on their own. Some of them would have probably come up with something that was as absurd as this and that is a problem. We don't have standards for teaching people how to work with the data in this. I like to call it making a better data sandwich. Data literacy is in uneven in our organizations. The data supply is also uneven and our use of data standards is uneven. And only by trying to engineer and architect these things together can we get them to work well. Now this type of integration cannot happen without architecture and engineering. And quality engineering products don't happen accidentally. Interestingly I found that particular deming quote on this tea farm that I was at in India last year. And of course we had the word data to it. It works out very very well in addition to that. Now I mentioned the engineering and the architecture pieces. The key for this, one of my favorite objects in the world, it's on a ship out in San Diego Harbor and it's taller than I am and it has a clutch and it was built in 1942 and it cemented to the floor and it's still in regular use today. So it was going on well in 1942 we were losing World War II very badly and we put 4,000 warfighters on this ship and sent them off and said go save the world. And these folks needed something every morning, breakfast. They also needed lunch and dinner and this is the answer to how do you make sure that 4,000 people can be fed every day for an indeterminate amount of time. And I don't care how many of these kitchen aid things that you have. They are wonderful things themselves. I'm pretty sure you cannot use one kitchen aid to make breakfast for 4,000 people because this ends up with a situation where you can't architect after implementation. All of these buildings that I'm showing you here from this old BMW commercial had to have a plan in order to do this correctly at first. For example if I were to show you this diagram of this building most people say oh goodness that's not a good building. Well it depends. If all you're going to do is to add a satellite dish to the roof, this building, this architecture, this set of components may be perfectly suitable. However if I show you this building you can see here that it has a cracked foundation and that clearly it is unsuitable for further investment. And what we have is the situations that I call the bad data decision spiral where business decision makers and technical decision makers are not data knowledgeable. They make bad data decisions but turns out to result in poor treatment of data assets and poor quality data which leads to poor outcomes and we've got to get out of this circle. So people don't get data right and data is not really well understood and if we're not teaching but a certain section of the people then people don't think they need to do data modeling unless something happens. So let's talk about how data modeling actually works into this. Modeling is both the analysis, the process of doing the thinking, and the design method, the process of showing it to define and analyze data requirements. You design data structures to support your data requirements. The model is then a set of data specifications and of course the accompanying related diagrams that reflect the requirements and designs. Don't let anybody show you a data model unless they also show you the accompanying definitions because it's representative of something in your environment and it employs a standard set of symbols in order to do this so that you can share these models across people with diverse backgrounds and they can literally read off the same piece of paper and understand it. Data modeling then is a complex process involving the interaction between people and technology. We don't want to compromise the integrity or the security of the data in that process. A good data model accurately expresses and effectively communicates the data requirements and the modeling approach needs to be guided by two formulas that are never taught in college. One, what is the purpose of your analysis and understanding the audience. If I'm showing a data model to an executive I'm going to show a very different layer of abstraction to the executives than I will to the technical people. That's appropriate and should be the way. One of my biggest eye-opening moments ever was when I was working for a senior executive in the defense department who said I don't care that you've got lines and drawings on those documents you need to put some warships on there so the general officers will understand there's something useful in it for them. So we actually had a clip art based data model and it was very very useful. The second formula then is that the deliverables plus the resources and time tell you what the approach is going to be and we'll get into that in just a little bit. I'm going very fast here and most of these slides all these slides you'll get but most of these slides are worthwhile to come back and take another look at this. So the data models facilitate the formalization. Once you have a data model of an existing system it's unlikely to change. I've literally worked for some organizations over a 30-year period and can go back and show them that their data is exactly the same data they were dealing with 30 years ago and that is very eye-opening to them because their business changes a lot. Data models also facilitate communication. Again I mentioned before it's the bridge between understanding people of different levels of experience types as well. So you want to organize your data models so that they understand how they explain a business area, how they explain how the business area interacts with an existing application or how a modification will improve or not improve our existing operations. These are also tremendously useful for training new people and finally the data models communicate scope because they tell us how big, how large, what types of things are coming in, whether this is covered by it. This is especially useful when you're buying new software packages. Now there are three types of data models. They're best articulated if you look this Wikipedia entry up and I've got the reference here for you. There's a conceptual model that allows people to think about what's going on in a way that talks to the users, the business users face this particular piece. There's a logical model where it hides a lot of details from users but still starts to get into how things actually work and finally there's a physical model so that we can put that in place so that the database as implemented in Oracle or IBM or MongoDB or whatever else we're doing. For example if we're going to change to a new database technology the database administrator should be able to change the structure of the database without affecting users and that's important. It's one of the creative pieces of data modeling where you separate the data layer from the application layer in order to do this. There are unfortunately a series of families of data modeling variants. Our good friend Peter Chen had come up with the original piece. There's a Charlie Bachman style and there's a James Martin style if you will and then we have an information engineering style and you can have all the holy words that you want over this. Most organizations pick one and move on with it. They all tend to accomplish generally the same thing. In most organizations information earring information earring engineering tends to be the prevalent one but there's nothing wrong with using any of the other styles here. They all work. There are some things you have to watch out for but just pick one and let's not have an argument about it and particularly let's not have an argument about it in front of executives. It does not help our cause. So that's the entities piece. Let's look at the relationships here now. It's a natural association between two or more entities. We talk about ordinality and cardinality. For example mandatory or versus optional relationships and minimum and maximum occurrences that occur from one to the other. Here are a couple of examples. In this model using the information engineering notation a single customer can place multiple orders. However a customer can also place zero orders. Why might that be important? Because if you define a business rule as a customer must place an order then you will not show up in the customer database until you've actually purchased something which means you can't pre-market to anybody. Kind of important, kind of subtle but really really key. An order contains at least one or more products. A customer may place zero or more orders. A product is contained on zero or more orders. So these are the ways in which you do this. And you do this a lot of times with a tiny intricate detailed data model that includes dependencies and purposefulness on this and now you start to get at what how the complexity of this domain really is. Let's take another question here just give you guys a minute. What would you think would be the relationship between a member and a club? Well it turns out there's a bunch of different ones. You can have exactly one. You can have one or many. You can have one eventually which puts a time dependency in it. Okay you may start off with zero but eventually become one. That implies a conversion of some sort. You may have zero or many optionally and you may have eventually one or many. Those are a fairly limited set. Let me show you how they work. Here's a data model. It's got a fairly rigid data structure in this case. One employee can be associated with one person but it's a little subtle there. I'm going to make sure that you see that. Notice in the bottom left here where employee is I'm going to change that piece to very subtly make it a little bit more restrictive. One employee can be associated with one person. After all you wouldn't want two people functioning as one employee or maybe you would. What if your organization supports the concept of moonlighting? In this case a person can be an employee on Mondays, Mondays and Fridays and a different person can be an employee on Tuesdays and Thursdays. Oh wow that's interesting. Well it depends on what your organization is. So you can see here the little subtleties that occur. Let's take another example of the same thing. One employee can be associated with a position. You're not a floater employee, you're just an employee and you may occupy that position. You may not occupy that position. But if I want to do job sharing I want somebody to work before lunch and somebody else to work after lunch and the system doesn't do this. Either of these two business rules if they are violated it becomes very problematic. And if they're violated it means somebody's going to have to do something manually or with another system to correct the problem that the system doesn't do in the first place. So I've now changed the data model yet even again to make it more flexible. This is back to the original way where we can have zero or more. Again just to make sure that you get this. Again one employee can be associated with one position. Notice the purple on the left hand side and then when I move the rest of this into play the purple on the left hand side now changes to the one eventually or multiple and now that's a more flexible data model. These two data models can be placed side by side but the one on the right because it is a less flexible data model has some implications to the way you would develop software to process the data contained in that data model. These data structures must be specified before software development or acquisition because if I build software one way and the data model the other way whatever that happens to be it's going to be a major issue in the organization. So the key for the data model is that everybody shares the understanding and from a dependency perspective this data model must be specified before we write software. If we don't then we will have to rewrite software or we'll have to compromise our data model with a less than optimal data model which may create 30 billion unnecessary queries in a day. Again our understanding is understanding as an architectural component. The data model is an architectural component it's a digital blueprint it illustrates the commonalities interactions among the architectural components and most importantly it's shared by humans and computers. So when we look at what's happening out there in the world and people are doing this type of work there's a lot of things that happen in the modeling world most people think you just do it. Turns out it's not nearly that simple. In order to understand the process of doing data modeling and remember the title is called data modeling is fundamental we first of all identify the entities. Now let's first of all present that we're talking from a blank screen we have no idea nothing's going on right. So we start by identifying the major chunks of business things about which we are creating, reading, updating and potentially deleting information. These entities may not be perfect to the first time but we put them out there and we try to get a good idea for how all of that is going to fit together. The next thing that we do is we label each of the entities. We may simply call them the noun of the thing that we're trying to describe. Let me take you on a slight deviation here one of the reasons data modelers get in trouble is because one of the highest levels of abstraction that we deal with when we are data modeling is something called a party. A party could be a person or it could be a company. So an abstract of both of those terms is a party and people look at us from the outside and say oh the data modelers are partying right well again going to be kind of hard to get past that. So be careful be aware of what you're talking about don't let people understand that you're doing some very serious work. Once we've identified each of the entities then we identify a key for each entity. A key means that we can find an instance of an individual record and be certain that that is the only one of that record that exists. I'll give you a very very concrete example of this. I have my graduate degree from George Mason University very fine university in Fairfax, Virginia been affiliated with it for many many years and have very fond memories of my time that I spent at George Mason however in the middle of my degree program I found out that my brother was actually set up to receive my degree my brother was two years younger than is two years younger me and he was going through the program at exactly the same time a different program and his social security number was one off for mine in addition even though he's younger than me his social security is one larger than mine. So if you find my social security number the next one up sequentially is my brother's. Now the reason that's kind of important is because my degree had been assigned to my brother which meant I was going to get his public public policy degree and that was not the degree I was there to study it was his thing in order to do this. These keys are critical and George Mason had at the time a not perfect system for identifying people from various instances of all the students which student gets this A and which student gets this degree. Those are kind of important things and by the way universities do a pretty good job of making sure that happens. We have lots of friends who spend lots of time doing that. There are things that technology could do that would be much more helpful to that but that's a different issue. So again identify the entities make sure that you label them and then make sure that each one has a key and by a key I mean some sort of way of identifying individually my record versus my brother's record versus everybody else's records that are in the system. Once you have the entities and the keys draw a rough draft of the entity relationship diagram how is this thing related to that thing related to that thing? The answer may be not at all but we still need to check it out and see what we can come up with. Then we need to identify the data attributes that are involved in this and where those data attributes get placed. So we move those data attributes from one entity to the other to figure out where they go and we map the data attributes to each entity. This gives us our first version of the data model. However it's likely to be imperfect and incorrect at first so we do a lot of work where we call validation and that is the idea that we probably are going to change the way the data model works based on further understanding of how that works. Also we may be improving it in ways that allow us to better understand how the model works. So I've did two things in here again let me just walk through it identify the entities identify the keys put out our first version of the model. It's a rough draft in terms of all that get the attributes populated so we've got them all up there and put them to each of the various entities so that every one of them has those pieces and then watch what happens here. It's not a big shift but it is a little fundamental shift. If I've written lots of programs that access the first version of this and then I change it I'm going to have to go back and rewrite all those programs. Key dependency in here and of course we may discover there are other relationships between these entities that we have yet to specify and that need to be specified in here as we're moving our way through the process. In the process of doing data modeling you're going to change some of your activities as you go through the modeling process. First thing you're going to be doing is identifying collecting and analyzing evidence. Your collection activity should be greater at the beginning and your analysis activity should be greater at the end. If that doesn't make sense then you're still discovering new things at the end and you're simply not ready to finish. So again more collection at first more analysis towards the end. Secondly there's going to be some project coordination retire requirements. Now one of the things I love to do is to walk into organizations and talk to people who are kind of skeptical about some of the things that we're doing and I'll say well please write down for me 10 people in your organization who you absolutely couldn't do without because they are so invaluable that they're too valuable to be pulled off and don't do anything else with. And then of course you take the list and turn around and say those are the 10 people I need to have on this data project because we're going to do a model and it involves their understanding of what's happening with the system because for some reason the vendor didn't provide the documentation. Oh I've never heard of that situation occurring. So there is a negotiation that goes on to get these subject matter experts into the room so that we can start to utilize them and be able to make sure that they are able to participate and contribute to the modeling as we're going forward on all these things. Next area what we call target system analysis. Now target system analysis is key in the sense that we're going to be doing the analysis of what the data model is going to do in the context of the larger system. And again we should be doing increasing amounts of target system analysis as we go through here as opposed to decreasing amounts. So these are all things that you can use to see how it works. And finally the modeling cycle I'll tell you this in just a minute but it pops up here as well. We need to spend some time refining the existing model but then we also have to come back and do what we call validation. And any model that has not been validated is in draft status. And yet I see this happen over and over again in organizations where they will never do the validation piece and never know that they're putting out a data model that simply is not workable. So again here what we're looking at is that things will change during your modeling activity during your data modeling where you will have less collection and more analysis less coordination and more just analysis time more focus on how the system works in the context of the larger system that we've put together and our focus on doing data modeling should evolve over time to include validation as well as the original refinement of the model. Something else too that I was taught a long time ago I think it was Graham Simpson told us this don't tell them that you're modeling. Just write some stuff down and then arrange it and then make some appropriate connections between your objects. There's nothing at all wrong with doing that. You can come back and give them a proper model later on but just capture some stuff. Many people get scared when you're doing this. It's not something that should be scary but it is something that we should be paying attention to. So let's move to the third section here which is the fundamentals. Fundamentals are really kind of key. Clive Finkelstein taught me this. We tend to teach in college and university that everything in the data model should be defined. That is insufficient. I'll show you why. Each model each modeling activity has to have a purpose or it's very, very difficult to see what you've got the right answer or not. So when we talk about a purpose of a model data models are developed in response to organizational needs. I may need a new database. Yes, it does happen occasionally and we're going to create one. By the way, the most common use for data models in today's environment is business intelligence and analytics. You still need a damn data model underneath it though. Many people try to do it without it and they fail miserably and they wonder why they're spending so much money on their BI in fact. Those organizational data needs become instantiated and integrated into existing and new data models and they are called data blueprints. Haha, see a little plug there. Those data blueprints then authorize and articulate specific information systems requirements. If we do not follow that path we have no documentation on the data requirements of our systems. If we don't understand those requirements we have no ability to make use of them later on in our process when we are trying to enhance modify, change, adapt our systems as they evolve over time. Of course we don't know any of this unless we have a feedback loop so we need to have some very specific feedback loops that say yes you got what you were trying to do or no you didn't get what you were trying to do. And the real problem with most of this is that people think oh good and then you're done. Now let me tell you a quick little story here. Deutsche Bank's been in the news recently and we've done a lot of work with Deutsche Bank over the years. Sorry we can't get you any of the President's tax returns or anything like that but we do do have some good good friends at Deutsche Bank over many, many decade association with them. And the reason I'm telling you about this is because Deutsche Bank did a lot of data modeling. They had an enormous system that they used to do their trading systems as their equity trading system called DB Trader. And they had the system that ran but they didn't have the data models. So one of the first things that I did for them as a freshly minted graduate candidate was come out and help them to reverse engineer and understand the existing system that they had there. That system, those data models told us what the organizational needs were that were being satisfied and helped them to redesign that system and it was one of the best systems in the world. I'm really curious to see who ends up with that system if Deutsche Bank is going to get rid of it. But of course the other part of it is we're never done. These are going to change and evolve over time. And these data models will continue to evolve as these systems evolve as our organizational needs evolve, as our understanding and as our specific requirements evolve around this. So the data modeling is a practice that starts and generally doesn't stop until you no longer need an HR function in your organization. It's a separate conversation entirely but let's get back to what I was really talking about which is the definition of the sorry the difference between definitions and purpose statements. So if I'm in a case tool this is by the way a new concept for millennials I know they've not been introduced to case tools because we don't teach them at schools anymore but they still exist and are still useful. We may have a concept of a bed and somebody defines bad as something you sleep in. Okay that's interesting. Well problem is it's a definition and it doesn't really help. So one of the reasons we try to work on these things is because we need to understand specifically how we're going to use the bed. And in this case this system was one we were designing for the veterans administration. These veterans administration systems things were also going to use the bed to keep track of the patients because believe it or not hospitals lose patients on a regular basis. I know you don't want to hear that but remember the third leading cause of death is going into the hospital so be aware of all of these things. When you say it's something you're going to sleep in that's a definition but if we have a purpose in this case for the bed and I'll show you that in just a second it shows a little bit more on this. So the purpose statements keep us focused in on this. Now let's look specifically at a context in here between a room and a customer I'm sorry between a soda and a customer. Soda is given to the customer we understand the formal relationship between the soda and the customer. The customer is not given to the soda the soda is given to the customer. So we can walk out the door in this case. The customer may select the soda but it's given to oh yes and that's right we've got to pay for it as well. So we understand the different characteristics here. If we go back to the difference of bed interestingly the purpose statement here is that beds are a substructure within a room substructure of the facility. It contains information about beds within rooms. So it's not just the thing people sleep in it also tells us what beds are in what rooms. So we walk out the door and we identify three top traits that go in and describe what's happening in the brand there. And in this case they were looking specifically at how do they keep the beds attracted to it and that's that would do with the RFID type of situation. Or we go back to that other data model I had out there and say is job sharing permitted. Well not if there is exactly one relationship between employee and position in order to do this because it can get filled by a zero or one but not by multiples. And that's a major major type of an issue in here. Again this changes our perspective about bed to go from wow we're not just looking at places for people to sleep but this is the primary means we are going to use to track our patients. So here's a different definition of bed right. Substructure within rooms sources of information about it a listing of the attributes that are here in association with other data items read in this case as a room contains zero or many beds that are in there. And oh by the way this particular model has been validated which means we know now that we can assume this model has sorry we know now that this model has passed the vetting test and is in fact correct as opposed to just an idea that's in here. Here's another couple of quick examples on how these purpose statements look. Again this is a data modeling example from our DIMBOC it talks specifically about accounts subscribers charges and bills. Okay so you can see how those relationships are set and what this model does is it sets up a controlled vocabulary so that when other people in the organization are talking about these things they know exactly what they're talking about. You go into most organizations they're like well that's that's production's definition of this we're going to use a different definition. These are horrible horrible problems for organizations. Here's another quick example on all this. Again we have an official purpose statement so the model codifies the official vocabulary when describing fuel customers, auto rental agreements. These are all the things that are up here on this. The interpretations are pretty straightforward it's a car rental company the rental agreement is central there's no direct connection between the customer or the contact. A contract must have a customer there's nothing structural that prevents a auto from being rented to multiple customers. That was actually the problem we were solving for them on here in this and phone units are tied to rentals. So if we don't have a rental we can't be borrowing a phone. Again these are all business rules and concepts that are understood here if people understand it that are not typically understood without the documentation. Here's another example sales people invoice orders line items notice I'm showing again a data model with some slightly different language. What you can tell from here though is that there's a sales commission based pricing information that it's difficult to change a customer address because the customer address appears on the orders. As opposed to a central place where the customer address might be changed in only one place. The price is not included in the catalog which means we're dealing with every customer individually. It is easy to implement variable pricing and difficult to implement standard pricing. Sales person information is not tied directly to the order and the sales people sell things that are shipped quickly so they get their commission faster. Yeah that's something that happens. There's nothing prohibits a sale from having multiple sales persons in this model and we have multiple invoices that can be allowed for one order. Wow what a complex situation. We can also ship things partially. So now you don't have a nice easy way of describing what actually happens in the sale there and the database can't tell what part of the order the invoice pertains to. Again these are problems all the way around. Here's another example. Data map for a disposition concept. Okay. And again here's the official statement. It has some official diagnoses and this one was kind of an interest to one. I'm just going to focus on one part of this. This is the every admission must have a corresponding discharge. Okay. That's a hospital situation, healthcare situation. Let's just take a look at how that works. Here's a bunch of business rules that go with it. I won't go through each of them. But an admission is associated with one and only one discharge. Which means when we look at admission and we look at discharge one of the discharges has to be death. I'm picking on hospitals today, aren't I? Yes. Death is a disposition code. Believe it or not, this encountered engendered a discussion with the director of the hospital system who said, I don't want death to be a disposition code. How did we get rid of this patient? In theory, most of them are healthy and they walk out. But some of them die and he didn't want that information in there. Well, this has information. This has implications to way in which your data is structured in the hospital so that you can do reporting. How many people were discharged from the hospital because they died. Apparently the answer was too many. So let's look at a couple more things here as we're getting close to the top of the hour at this point. One of them is that most organizations go about this process incorrectly. The idea is as they start out with the data, they just start out with some sort of organizational strategy. They build some IT stuff and then they try to fix data in there in order to do it. It simply doesn't work. And what we need to do is flip that and all I've done between these two diagrams is flip IT projects and data and information in terms of where they're implemented. It may seem like a simple thing but man, it is a bear to do this from a cultural perspective. Absolute bear. The things that you have to understand in order to do this are data has to proceed software, data structures have to proceed code, shared data has to proceed completed software and data reuse has to proceed reusable code. If we don't do things like this then we end up having to do rework, scrap and throwing things away which is why so many IT projects fail. I've done analysis of well over 100 major IT failures and the root cause of each and every one of them has been the data problem data model problem all the way around because data programs data structures shared data and data reuse are things that are all covered by data models. See every organization manages different types of architectures to different degrees your process architecture your system architecture your technical architecture your data architecture all of these things are really really problematic but if management thinks that all you're doing is having a bunch of meetings it's not very interesting to management they don't think the work that you're doing is very important. Let me continue here by showing you a particular context model on all of this what we teach students about typically is that there are three types of requirements there is your requirements assets there are your design assets so you do some requirements they're conceptual in nature you do some design that corresponds to the logical model typically I mean you do the as is the physical model that we talked about earlier those are reasonable and that's the way we build systems the problem is we don't tend to do much of that anymore and so students don't have a good understanding of how important all of these dependencies and intricacies that we've talked about are instead we need to look now more at reverse engineering where we say we don't have documentation so we may need to recreate the data or we may need to recreate the design or even the requirements as we go through we may need to reverse engineer going from the physical as is to the logical as is and from the logical as is to the conceptual as is we may then look at our new system if we're not going to change the requirements I can move directly to design if I'm going to change the requirements I have to go back and re-engineer the requirements and then redesign the data and only then do I get around to discussing what should be the new model for the new system if I don't follow that path and simply fork with the data from the existing as is implementation to the new to be implementation which happens over and over again you end up with the wrong fields and the wrong data and people simply talk about data problems or the system's not working but that's usually what's happened all of this of course involves high degrees of metadata management I'm going to list up all of those bits and pieces therefore you just so that you know how to go back and undo that diagram if you ever want to take a look at it again take a look at another concept here that I think is absolutely key and fundamental it's almost our last one but we're getting close all data modeling takes place within this framework of boxes as is and to be as is is what exists to be is what we hope exists right and conceptual, logical and physical models we talked about those as well and again validated or unvalidated if they're not validated we can't use them what we're trying to do is map this transformation into some part of this framework it's just that simple we're taking some piece of a model it's either conceptual logical or physical we're going to transform it into something else and there's a very nice way of going through that process as we're doing that we lose one aspect of this that again colleges and universities and textbooks do an absolutely terrible job of describing so what we're talking about here is I've taken that previous diagram and the as is is now the bottom half of this technology dependent layer and the logical layers above that is above it and the layer of abstraction and we tend to tell people that what you need to do which is what I just told you a minute ago you start off with your physical as is you reverse engineer that to your logical as is if you don't need to change your requirements that becomes your logical to be and then we go back into your physical to be and that sounds reasonable and even that's hard to get people to follow those pieces as I mentioned before most of the time they just go from physical as is to physical as is but that's not the way it works same diagram let's look at it with a slight different event on this we move this up to abstract here and at that point the only reason we're doing this is to bring in other logical as is data architecture components because we are moving or changing the system to notice the green box now changes to a blend of orange and green and only when we've done that do we move it over and go through it I have looked at dozens and dozens of textbooks none of them show these examples on this let's look at how data models can be used to support strategy here now in the last three minutes first of all if we do this we can create more flexible data structures there are data structures that are objectively more flexible than data other types of data structures and if you don't know how to do it or the difference between them you are unlikely to end up with flexible data structures flexible data structures lead to cleaner less complex code which ensures that we can now measure effectiveness more correctly we can build in future capabilities and we can understand better about merger and acquisition strategies and concepts here this is a data model of one that I used in many forms in the old days and they told us that eventually they wanted us to have every employee also be a manager but they didn't have any way of supporting the employee commission at the concept of manager level because they had a badly designed data model so I was always getting beaten up by saying oh you are not doing enough sailing and I can have a sales ID that I do I didn't make any sense to them at all again if your systems weren't designed to integrate what's the likely that they will happen to do this so data models help you by achieving effectively efficiency goals and they also help you sorry I went too fast on that they also help you not just effectiveness efficiency but also by providing the organizational dexterity that you need to have in order to implement this if I am looking at one software engineering effort the data modeling effort is going to be broader than a typical program or a typical software that I am purchasing and we need to make sure that that actually occurs across multiple parts of our data set in here so that data flows throughout the entire model allowing us to make decisions analysis scope problems that are charged with data interchange or effectiveness and the goals that we are connecting to are strategic and operational as well so while I am getting ready for some questions here at the top of the hour I want to go just one more thing here with you which is show how you use models you use models to store and formalize information to filter out extraneous details to define what is called the essential set of information to help understand complex behavior to gain information from the process of developing and interacting with the model and evaluating various scenarios that are in there and when you look at it from a business model perspective the goal must be shared understanding we can't have disagreements or we will have insufficient communication that it's largely automated and therefore it's got to be perfect when we automate it data modeling characteristics are going to change over the course of the analysis time you're going to incorporate these purpose statements rather than just definitions in them because the use of the modeling is much more important than the selection of a specific modeling method the models are living documents and they allow you to go through the entire process of doing this and with that we're back at the top of the hour and I will turn it over to Shannon and we'll see if you guys have questions for us Peter thank you so much for another fantastic presentation and of course there's a lot of great questions coming in if you have questions feel free to submit them in the Q&A section in the bottom right hand corner of your screen and just to answer the most commonly asked questions just a reminder I will send a follow-up email to all registrants by end of day Thursday with links to the slides and links to the recording so Peter diving in here how is data modeling different in the big data world that's a really good question one of the complaints that people have had about data modeling is that it takes time there's a statistic that has no basis in fact but we've not found a reason to disagree with it which is that it takes about an hour per attribute to do a model so if you're looking at a model and I showed you a couple of complex models in here then somebody says you know how long is it going to take and why is it taking so long the answer is well you need to allocate enough time so that you can actually understand the model at a very detailed level and so given that if we look at how these models are created and used this is an actual model for some system that we've worked with at some point obviously I've neutralized it so nobody can see what the actual model is here people say that takes too long and I don't want to spend that much time doing it I want my data faster well the promise of what we are been calling big data and again don't get me started because it's a terrible name we might as well have called it polka-dotted data right as it just equally as useful big data allows us and some of the modeling technologies and particularly the IT technologies that we're using allow us to do what's called understanding the data on read so in this case for this model we have to prespecify the data model which becomes the schema for the database and then we can actually get some functionality out of it in the big data world there is oftentimes utility that can be gained by looking at the data that's coming in as it's coming in before it's been formally interpreted now let me give you an analogy for how this is playing out in the real world big data is kind of like the way snakes use infrared sensors snakes when they're crawling through the grass some of them have infrared sensors on the side of their heads now if you're crawling through the grass and you're a fairly small animal like a snake and you sense that something is rustling next to you you might put your head up and it might be a rat and you might say I'm a snake and I eat rats and therefore no problem I'm now less hungry than I was however it could also be a mongoose and mongooses eat snake so when you put your head up that's not a good thing but your infrared sensors can help you to understand it's a big warm thing or it's a rat-sized warm thing and the rat-sized warm thing you may say I'll take a chance to see if I can eat it whereas the big warm thing I'd take a chance it's a mongoose I'll keep my head down and slither on out of the way my point here is that modeling is not unimportant when you're doing big data but the reason it's sold for most organizations and not understanding executives that don't understand this is that it allows you to get access to some of the value of your data somewhat quicker but you'd pay a price to do that your price you pay is efficiency there's always a trade-off in this so in a big data world people will say to you you don't need a data model you still need a data model to understand it but you may not need it right away so long answer to the question in the big data world they're trying to bypass some of the formal and rigor that I'm talking about has to be part of this process and in some instances it can be done well however if you've heard and seen the thing on LinkedIn there's a wonderful thread on LinkedIn going around right now that says the era of the data lake has come to an end because this was going to be a wonderful thing that we were going to put together and make a big big data lake and everything was going to be accessible to us what turns out they're most of them returning into data swamps that's very difficult to do on this there are certain types of data that will work well for and certain business situations where it will be your job as a valued member of the organization is to figure out when it's appropriate to use technique A formal data modeling or technique B a little bit less formal data modeling and a little bit more big data on the side I hope that answered the question because I think it's a really good one it was great question and great answer always thorough I love it so Peter may you explain identifying versus non identifying relationship and when is it the best to use one or the other so that's a nuance question that we didn't really cover here but let's just go back and do a little bit with identifying I'll just pull this one up here all right so here's a model I'm not really showing you much on this model but an identifying relationship would be one where we would be able to identify something on the other end of the relationship so for example I mentioned the data analysis piece that we do our formal name for that is normalization and in normalization I mentioned before we would probably want to make sure that if I was going to charge a customer I would charge exactly one customer for exactly one product that was ordered you can use an identifying relationship in there to identify a product code and the product code can throw you into a taxonomy that says these are the types of products for example it might be a hazardous material product and if you identify that product as a hazardous material product you know that you can't pick it up and put it in the back of your car you need a special handling special delivery types of capabilities that are associated with it as opposed to just strictly identifying the relationship that comes out each of these has different ways of doing this and there's a lot a lot more besides what I'm showing you in this one hour here in terms of data modeling what I'm hoping you're getting here is that if that's an important part of what you're doing you better know about the difference between those two relationships or you're likely to make a mistake in the way you're developing your systems or better still we've seen this happen a lot many of the software vendors do a terrible job of implementing data in their software and so consequently you end up with the problems in the actual software where we've done a good job of helping people debug some of those issues again great question pretty detailed but I hope that provided an answer for you love that you mentioned organizational needs should drive the is requirements data models in between question is how to integrate data modeling with use case analysis and design great question so again what we teach in school is that you should be able to do a use case on this particular model so it turned out it's the same data model here let me pop this one up so this data model could have been developed in response to some specific user requirements one thing we're really good in IT is labeling things very poorly so I don't like the term use case but everybody seems to know it and understand it so this use case might say a customer needs to be able to place orders in our system and those orders should contain products in our system so this data model even though I'm just showing you a tiny snippet of it could have come from a use case on this if you tried to develop software without developing the data model for this you might have resulted in a different type of a situation here I've got a customer an order is placed by one and only one customer what if I've got two customers that want to place one order if that wasn't covered by the use case I can't implement it in the data model and the system will say oh we can't do that here whereas of course we know always the customer is always correct and so consequently we never want to say to the customer you can't do that we almost always want to find a way to do that so in this case this would be downstream from requirements but upstream from coding again notice the sequencing there let's get the requirements understood let's understand them from the data perspective and once we understand the data requirements then and only then should we start the process of developing code for this now one of the things that happens here people say well we're not developing much code these days good no problem you can still use these techniques these data models to look at the software packages and it is now considered to be best practice to ask every vendor that proposes to give you some software for a data model of how they handle the data internal to their software half of those vendors will not have the slightest idea what you're talking about and you should immediately exclude them from any further consideration the other half of those customers those vendors though will say ah wonderful a smart customer send you a logical data model of their system or a physical as is data model of their system because they know that you will look over that data model and see whether their your system is compatible with their system and the data model is the only thing that's going to tell you that again a system might say I've got a customer number and a customer number but if your customer number handles only six digits and the existing one needs to handle 24 digits I'll just mention a certain large packaging company that's going through a little bit of having to expand their customer number and their remediation process has gone well over 10,000 systems that they're going to have to change and because the data model was incorrect in the first place on this so again requirements then this model that I'm showing you here on the screen and then coding if you do it the other way around you run the risk of seriously creating an inferior product or having to rework and redo your data implementation the code that you've implemented because you didn't fully understand the relationship between these entities steps are taken in validating a model great question one of the things that we have to be able to do is to make sure that we have access to people who understand how the business works and how they use the data from the business perspective so the question of validation is to put up a data model and I'm going to go back to the one I showed you this is by the way a defense department model which is why I can show it to you because it's out there in public domain we didn't have any release problems whether it can be a second to find it there we go so given that you have a data model that looks like this one of the things that you can do is to say hey if I'm going to buy a new software package to come in here I want to take a look at those pieces and see how that works on that but these data models are actually living documentation that you have and in most cases you don't know what's there you have a blank screen of paper or blank screen and that's a terrifying thing for people so one of the things we do is to help people that we're in the modeling community we help people come in and develop these models by talking conceptually about the things the business entities right again they correspond roughly to objects in an object oriented context so there's no no problems from that perspective and saying person and what's the relationship between a person and an employee and we may put some things up there how is that right no no no that's not really the way it should be it really should be a little bit different even from that so again I'm just taking the person to employee ratio on that side and notice I'm changing it from zero to one to zero to one to many that small little subtle change can make a huge huge difference in the way your systems work do you have thoughts with regard to cultural nuances and impact on data modeling especially when data is gathered analyzed and shared globally wow so again great question on that one of the more interesting parts of all this is that there are all kinds of people who like to do things different ways and because some people think you know when I'm doing this I've actually gotten in trouble in some organizations because they said oh no you're you're exposing intellectual property right there's all kinds of nuances that you have to be careful of so don't start off with a formal method just write some stuff down then after you're out of the room arrange it later on and make some connections between your objects then you come back to people and say if I heard you correctly this is the relationship that you were describing and if they do you can take that relationship and transform it into a business model by the way I forgot to say something on the validation part too one of the things well all right let's just go back to the most basic level I should start it here didn't think it through the answer but that's all right we'll do it all right somebody says to you how do I validate the data model one of the first things you do is say what is the goal of testing anything and the goal of testing anything is to break it and that's very different from the way most people think about it most people are like oh test it just want to make sure it runs okay blah blah blah did it run and it was fine you know hi honey how was your day fine how was your day fine is there any information transferring there no that's the problem but if instead you say your job here as a model validator is to break what I have and then you reward them for breaking it guess what you're going to find errors in your data model that you didn't think about and that is hugely important which is why again not just from a cultural perspective real important to say look I'm going to write some stuff down I'm going to go away and not do it in front of you because that's boring nobody wants to watch you do a data model you go away come back and say here's what I heard you say your job is to prove it wrong if they prove it wrong you know what you say thank you because if you say thank you they'll come back and expect that there's more that you want to again they'll get shared ownership in this this concept in here but there are a lot of cultural influences it's just like in the Middle East you never go in take your shoes off and prop your feet up on the table and point them at your customer right that's just the worst way to make friends in the Middle East if you don't know that right there's lots of little subtleties that you have to get to and I think it's a great question are there differences in data modeling practices for conceptual logical and physical data models absolutely let's go back to our schema so the ANSI schema that I was talking about earlier the three layers of the model show you I guess Shannon it's probably the wrong time to ask you if I'm showing the right screen so I must be showing the right screen or I wouldn't have you already yelled at me a long time ago on this where is my ANSI model here that is the problem of course with many slides we have there we go okay so here's our ANSI model the question was again a good question are there different purposes in doing a conceptual model versus a logical model versus a physical model so let's start with the physical model the physical model has to be the model that runs in your system there should be a one-to-one correspondence between every element in your physical as is model and every element in your actual database that is the point when you go to a logical model what you do in a reverse engineering context is you to remove anything that is physically dependent on the type of database Oracle does it this way I do it this way you know again different ways of looking at it the logical part is a transformation between that conceptual and physical and the conceptual model is only going to show things that you think the business users are going to have so again a business user model might be like that model I showed at the very end there where I was taking the model and redoing it a couple of times it just gives you the concepts it says these things are related to these things and that is related give you a very specific example we were working with Circuit City back when Circuit City was around and somebody was trying to build for them a very large customer data warehouse and we pointed out to them it's very hard to understand customers because they didn't really want to understand customers they didn't care what their ages were although you know somewhat important the most important dimension there was their behavior and their behavior as far as Circuit City understood it was simply what they bought at Circuit City so here's a whole data model at the conceptual level that didn't have any concept of customer behavior in it it had all sorts of demographics about customers where they lived how much they made you know blah blah blah but it didn't tell them anything about what they bought if I've got a wealthy customer that's only buying records or CDs from Circuit City that's not going to be a customer that Circuit City could base its business on oops did I just tell a secret there maybe one of the reasons Circuit City is no longer part of our industrial fabric in the United States here there's a great movie on that that can give you a little bit more and there's a guy named John Morgan who will be happy to tell you the story if you buy him a beer because he followed that whole thing all the way through and again each of these models is going to have a different need conceptual is going to be closer to the business users physical is going to be closer to the technical implementation of it one of the ways people consider this is that these correspond to different layers in the Zachman framework again there's lots of good valid arguments over whether they're at a specific row level just think of it the conceptual is the highest level of abstraction and physical is the lowest level of abstraction and then within your team you need to agree because if you're not speaking the same language on your team you run the risk of the team stomping all over each other somebody like David Hayek can give you a lot more detail on that and I know he's listening on the call here so David maybe you want to chime in at some point on that with a concept on that anyway great question hopefully that answers that yes indeed so selling the value of data architecture as a challenge and it's becoming more challenging in a DevOps agile world what are the value props of data architecture in a DevOps and agile world the real key to it is if you understand in the middle of an agile sprint that you have all of a sudden misinterpreted not fully explained do not understand fully a data requirement you need to pull the ripcord on that agile sprint and stop it because any more work that you do in that area will be absolutely unproductive this gives you the opportunity then to say we can set up some specific prerequisites to agile development or DevOps either they both work in the same concept conceptually at this point and use it as a gating function if we're not ready to start coding well excuse me we are not ready to start coding unless we have fully understood the data requirements and you can't have an understood data requirement if you don't have a data model so if you don't have a data model you should not be doing any DevOps work it is a necessary but insufficient prerequisite to success of these technologies and is it possible to produce good data models that stand the test of time and interoperably when you are working in two-week agile sprint usually without the benefit of knowing the big picture that is a really good question the short answer is no if you are working in this in a all right let's do the difference between if you're working in two-week increments in terms of sprinting and again it's a very good way of understanding your conceptual user requirements during the sprint there has to be a program so not a project but a data program that is running in the organization if you do not have a data program in there then the only thing the agile sprint will do is help speed up the coding and you'll still end up with embedded data model errors that can be very substantive and in fact I can give you a couple of instances where there were almost at the company types of moments where they had issues around that I think I save those for the data quality talk that we do on that but again there's a couple banks in Japan that have some issues and little minor things that occur in there key to making these work well is that if you have a well-established data a mature data program and I mean that program in the PMI sense project management institutes definition of a program if you have a program then DevOps and agile can work very well if you don't have a program if you're trying to do your data modeling at the same time as you're doing your software coding which is what we teach you in school well you'll achieve the same results most everybody else has which generally did not turn out well if you're interested in those results I have a couple papers that I've written that talk about those results in very specific detail also Dave McComb has done a fabulous job of talking about problems in the big software development efforts and the big application development efforts so it's again if you're happy with the current situation don't change anything but if you want to do it better yes get a data program in there to complement what you're doing in DevOps otherwise it's just throwing money down the drain and Peter do you find developing a conical data model useful as a reference point to same-data definitions across different systems to drive a move towards consistency across most multiple data models so let's unpack the statement a little bit the question was canonical or a reference model that we're going to use again I'll use Deutsche Bank as an example since it's one I've published on and certainly not telling anything bad on Deutsche Bank Deutsche Bank had a series of back office trading systems that they used for years and years that were not documented again that was one of the ways I got involved with them was by helping to understand the documentation it was there so a wonderful colleague of mine named Diana Elman who actually did all the work there and I remember very clearly her CIO telling several times to me and others I don't understand what it is that Diana does but I know that when I go to my office in Hong Kong and see her data models up on the walls and people using those data models they're not writing luncheon menus and recipes on those data models they're using those data models to uncover parts of my business that I don't understand so while I don't understand what her model does I understand the utility of her model on how it is used in the larger sense in there some organizations understand that and do a great job of it others do not and it can be very very difficult to work in an organization that doesn't understand what and why you're doing these particular models and seeing the value of data architecture is a challenge and it's becoming more challenging we kind of already talked about that about agile so you know let me move on here so we've got it if you can we have a whole conference coming up the the fall too Diana I know yeah you're really one of that question answer we got it so okay so um can you reiterate the importance of starting with conceptual logical and not going straight to where the developers want to go which is the physical sure and and the physical is the tangible part so it's a natural understanding of why people would want to do this but let me make sure that we put the question up so everybody understands so again modeling in various contexts here and what I showed is that you've got three stages conceptual logical and physical and then you have your as is and your to be and whether you're building a new one it's forward engineering if you're doing an all existing one is a reverse engineering okay now the question was why can't you just go from the upper right hand corner which is the existing as is physical model directly to the new existing to be model right so oh four if you will to oh nine on that diagram or one to oh nine and those two quadrants the reason that's a bad idea is because the new system may use data differently than the old system does that's why I'm saying it's a best practice to get a copy of the internal data model that they use in their software before you purchase it and use that actually as a purchase discriminator because if you don't you run the risk of something bad happening now let's just give a silly example here which should be easy enough to do but turns out isn't all right let's just pretend that the upper right hand corner the as is physical model contains your name as one thing my name Peter Akin okay well if the new system uses separate fields of Peter and Akin what the consultants will do is they'll come into the people who use the old system and they'll give them a spreadsheet and say what goes in that first piece they'll say name they'll say great name over here is this column so we'll move everything in this column in the existing database to this column in the new database and I end up with things like American Airlines calling me Peter Haynes which is my middle name instead of my last name can you detect a data quality problem there did they do exactly what I just told you they shouldn't have done of course it's exactly what they did in order to do that and that's a minor thing what if you've got connections connected to that for example not airline connections but let's just say I've got my million miles I don't have a million miles but let's say I had a million miles with them and it got disconnected well it was a million mile customer I'm going to be pretty upset and probably switch airlines if you drop my million miles in there you can drop entire series of transactions if you drop a field that doesn't work it was made as a key in an old system and it's no longer a key you need a new key in the new system to do this so the importance of doing exactly what the questioner asks which is going from your physical as is to your physical logical as is and maybe to your logical excuse me conceptual as is again going across the top row if you're not going to change the requirements you can simply go from physical as is to logical as is logical as is to logical to be because that's where the business rules come into play and the business people business people are not going to be able to understand the data models on the right hand side of this screen because they are dealing with Oracle and Hadoop and internal components and techie stuff that they don't understand but at the logical level they're going to have business concepts in them it's going to say things like sales and receipts and products and that process has to occur again sorry to sound arrogant here but I have made literally millions of dollars over the last 30 years of undoing people's mistakes that have gone in that other direction and more importantly it has an impact on things like national security and corporate integrity so I hope that answered the question it's absolutely key takeaway from this if you didn't get anything else from this do not let anybody go directly physical as is to physical as is it is one of the worst mistakes you can make so we have a lot of great questions still coming in Peter but we only have two more minutes so let me see if I can throw one more in here how do we enable adana models to represent many types of physical data stores such as Kafka Elastic Search Aeropsych a lot of unstructured and semi-structured data great question and again this is where a conceptual model might help in this case to do a class of models a class of modeling types so that when you look at the ANSI Spark stack here again the multiple user views may also represent different physical representations so you can have a community view of the database at that conceptual level but then the physical level could be some of it stored over here in a document store some of it's stored over here in a non tabular database and some of it's stored over here in a tabular database in order to pull all of those together I know that's a real short answer but we're short on time here and that of course is what the community is for to come back and talk about these more offline after we go a little further with it it is indeed and the Peter and that community that Peter's referring to is community.dativersity.net where you can continue the conversation lots of great forums and the conversations going on there and Peter thank you so much for another fantastic presentation as you mentioned that is all the time we have for today love all the engagement so many great questions that have come in and are still coming in and just a reminder I will send a follow-up email by end of Thursday with links to the slides and links to the recording of this session and all the other information and as Peter's showing there August is coming up the next webinar in the series data management versus data strategy we hope you all can join us for that as well Peter thank you so much everybody thank you I hope you all have a great day everybody have a good one thank you Shannon as always