 Hello and welcome. My name is Shannon Kemp and I'm the Executive Editor for Data Diversity. We would like to thank you for joining today's Data Diversity Webinar, Trends and Data Modeling. This is the latest edition in a monthly series called Data and Online with Dr. Peter Aiken, brought to you in partnership with Data Blueprint. So we at Data Diversity... So let me get out further ado. Let me give the floor to Megan Jacobs, a webinar organizer from Data Blueprint, to introduce our speaker in today's webinar. And so we can see the presentation again. Perfect. Megan, hello and welcome. Thank you. Hello everyone and welcome. My name is Megan Jacobs and I'm the Webinar Coordinator here at Data Blueprint. We are pleased that you found the time to join us for today's Webinar on Trends and Data Modeling. As always, a big thank you goes out to Shannon and Data Diversity for hosting us. We'll get started in just a few moments after letting you know about some housekeeping items and introduce your speakers for today. A one-hour presentation followed by 30 minutes of Q&A. We'll try to answer as many questions as time allows, but feel free to submit questions as they come up throughout the session. To answer the top two most commonly asked questions, yes, you will receive an email with links to download today's materials and any other information request during the session within the next two business days. You can find us on Twitter, Facebook and LinkedIn. We set up the hashtag DataEd on Twitter, so if you're logged on, feel free to use it in your tweets. And submit your questions and comments that way. We'll post an eye on the Twitter feed, and we'll include answers to those questions in our post-session email. Now, let me introduce you to our presenter. Peter Aiken is an internationally recognized data management thought leader. Many of you already know him or have seen him at conferences worldwide. He has been in 30 years of experience and has received many awards for his outstanding contributions to the profession. Peter is also the Founding Director of Data Blueprint. He has read dozens of articles and eight books. The most recent is Matizen Data Management. Peter, I'll let you introduce our guest speaker for today. Thanks, Aiken. So we're pleased to have, let me say, Stephen McLaughlin. Stephen has a very, very interesting career. He's been the head of marketing and PR for a game development firm. He spent a lot of time in application development. And then, like many of us, he came to the realization that if we could get the data part of the application stuff right, things would be easier, not easy, but easier. And that's, of course, all we're trying to do is to help people get to easy on this. Stephen, like all of the Data Blueprint associates, is a certified data management professional. He has a computer science degree from VCU. And he's done a number of different things, including some activities on the outside where he has his own podcasts that he has. And we did forget to put the link in there. We'll put the follow-up on that. But tell people what that other podcast is. So that podcast is about the broadly appealing topic of historical miniature wargaming. So I'm sure almost everyone out there knows what that is. Right? Okay, so nobody knows. That's right. They're basically little toy soldiers. And it's a very niche little industry. And it's a lot of fun. I have a series of websites that deal with that, mostly historical games, but other things like board games and some of the newer stuff you guys might be familiar with as board gaming becomes kind of a newer trend. So pretty awesome. Very cool stuff. So again, well-rounded individuals we have here. That's one way to put it. So Stephen, terrific to have you with us today. We're going to talk about trends in data modeling. And what we wanted to do was to bring all of you into sort of what we're seeing in the marketplace. So while there's a lot of talk, various events and things, Stephen is on the front lines of seeing how this stuff is implemented on this. Now just to start out for everybody's background here, we commonly believe here at Data Blueprint, and I think Data Virtually can include it in that group as well, that data is the most powerful yet underutilized and poorly managed organizational data asset. Assets are resources that need to be controlled by the organization because we're going to have some benefit in the future. They're a very simple proposition, but harder for a lot of people to actually put their finger on it. Data really is your sole non-depletable non-grating, durable, strategic asset. And when you compare and contrast it with other types of assets that we have, some people say it's the new oil, some people say it's the new soil, because you can plant things in it when they grow, or we've even got one group that says it's the new bacon. I would take that as a step further, but it's the new band mayonnaise. Band mayonnaise. There we go. All right. Like the Bison burgers were even the other day. When you compare that to data assets, to financial assets, real estate assets, and inventory assets, we manage those assets in a very different manner in which most organizations manage their data. So our goal here is to help you all unlock specific business value in your organizations by strengthening your data management capabilities, by providing tailored solutions to your specific business problems that you're having, and building lasting partnerships with you overall. So let's dive into the presentation. Stephen, I want to talk about today. So you'll notice there's a bit of a focus on NoSQL kind of towards the middle there. I think that's just some emerging trends that I think might be kind of interesting. There's a little high level approach to it today. I'm not going to get too far under the hood, but that'll be kind of interesting to sort of approach those from a broad spectrum. But starting from the beginning, we're going to go back over what really a data model is. We hope that most of you joining us are already familiar with those concepts, but just in case we're going to walk through what a conceptual and a logical and a physical data model is, and specifically we're going to tie that in by addressing what issues poor data modeling can introduce. And you all, hopefully, are all familiar with this. There are a number of issues that I can introduce. We're going to go into different models, different uses. The focus is really going to be the right tool for the right job as opposed to one model to rule them all. So we're going to talk about, again, like I said, some of the NoSQL architectures that are sort of coming to the forefront now. And then we're going to tie it off by talking about kind of how it's changing. We're going to discuss patterns and reuse this growing abstraction for application and data sharing. We call that data sharing the world, right? These APIs that are going to just make data available to everyone and anyone who wants it. And then we're going to end with scaling out, knot up, and a few thoughts about things like sharding and make sure you shard yourself. We knew that was coming. Yeah, I'm one of you guys. Great. Well, let's dive in. And what is the data model as far as that goes? Right. Okay. So a data model, as you all are familiar with, it basically organizes your data into elements and standardizes how the other data elements relate to one another. Pretty straightforward. You guys have seen them probably all throughout your daily life. And data modeling made simple by Steve Hoberman. He said a data model is a wayfinding tool for both business and IT professionals, which uses a set of symbols and text to precisely explain a subset of real information to improve communication within the organization and thereby lead to a more flexible and stable application environment. That's a very good idea, but I really think that's a pretty exhaustive explanation right there. Another way to think of it, too, is Steve Hoberman concentrates on this definition, which, again, is what you do in the practice, which is the idea that we really all need to be on literally the same sheet of paper. And one of the other things you don't know about Steve is that he's also a musician and has had a band in the past and did well over a thousand shows out of the back of a band and lived that way. It's just not as glamorous as it sounds. But as a musician, you can appreciate the fact that we need to be on the same sheet of paper. Absolutely. And it's a common language for you to speak. Precisely. And that's exactly what we're trying to do with the models here, is to provide that same kind of commonality in there. Right. I was going to talk about how data models are expressed as architectures. Apparently, I think the main idea about them is that these attributes of these entities are organized into entities and objects, and that's graphically represented typically. Uh-oh. I lost my screen for a second. It entitles... I'm sorry, that says entities, doesn't it? Entities, objects are things whose information is managed into full strategy. Now, let's look at the next one. I'll show a couple of examples. Yeah, I mean, there are a bunch of examples of entities organized into objects here, and attributes organized into entities. We'll get to architecture, though. It's a little bit harder to show the examples. That's the reason why. I'm glad I read that title. That was pretty global. Combinations of attributes and entities are structured to represent information requirements. That's pretty straightforward. You're thinking of a customer. You're thinking of an address. You're thinking of a shipment or a product. And poorly structured data constrains organizational information delivery capabilities. Right? And that's... I think if you have poorly defined customers and you're trying to ask questions about your customers, that's going to add all kinds of complication on top of that, as I'm sure most of you have come to be familiar with. And many of you have heard me make this analogy as well. One of the things we absolutely try to get the clients that we're working with to do is to not label things at a long level of abstraction. For example, customers are almost always the wrong level of abstraction. A very simple example is if you're talking about customers and you say mail a coupon to our customers, and if your customers are current customers and you're telling them the thing that yesterday is now cheaper as a result of this coupon, they have a very bad taste in their mouths. Also, we almost always need to take our customers and divide them up into different subtypes. Again, current customers versus potential customers is one way to think about it. And I think you'll see a common theme about what Peter's talking about here, and that a lot of thought needs to go in at the front end if you really want to get the most you can on the back end. So look at that going from more granular to more abstract, and finally the last part of it. That's right. So the models are organized into broader architectures. And so when building new systems, architectures are used to plan development. And more often, data managers do not know what existing architectures are, and therefore they can't make use of them in support of strategy implementation. There are also places in siloed systems coming up that aren't even realizing that the next building, the next office, maybe the next cubicle over is already doing something that they might do as well. The reason we don't have examples of this level is because architecture is a much broader topic and it takes a lot longer to walk through these. So once again, you all in our audience have your own examples of architectures. And if you can practice that process of explaining to people how useful or not useful that particular architecture is, and that gets to another component, which is something that we also like to practice for everybody, which is to say that don't put a model together unless you have a purpose statement for that model. We're doing this model in order to achieve X, Y, and Z, whatever X, Y, and Z are. And if not, it just becomes an exercise in nice practice, but probably not business value. And of course that's where we really want to concentrate. That's absolutely right. So I'm going to review a little bit of some basic knowledge for some of you, but this may be new information for a couple of you listeners out there. The idea of the conceptual data model, and we've mentioned earlier that it's basically three levels, is the conceptual, the logical, and the physical. The conceptual data model really represents entities and relationships. It should identify the domain and scope of data this is exactly what Peter was just talking about. At this level, it really should be easily understood by business users in order to communicate core data concepts and joint application requirements. That's exactly what Peter was talking about with music. Everyone's looking at the same feet without going too far into the theory behind what makes that music enjoyable. This is a very high level. You're saying this is a customer and he lives at this address, just like in our example there. What isn't necessarily fleshed out at this level, although it can be, is for example, a customer may have many addresses or many customers may share one address. That's pretty straightforward. That will absolutely be represented at the logical level and it's a bit optional to be represented at the conceptual level, I would argue that it should be. Let's look at a specific example around this. Here's an example from the Veterans Hospital system that we were working on in the early 90s. This is an example of how data modeling was done in the past. One of the reasons we're doing this particular webinar is because we wanted to bring you guys up to date with some more modern techniques but this is a very good traditional use of the data model. You'll notice that I've circled the relationship between admission and discharge. We have a question that has occurred about that at one point in time. We can look and see that the admission associated with one and only one discharge, we can also then look by the data model and see if we can see what the difference is between an admission, in this case, and an actual discharge. When you look at these two here and the circles are slightly off on your screen there but we can look at the precise definition. One of the things that this gives us is the understanding that every admission must have a discharge. Therefore, one of the things we brought to the attention of the group we were working with at the time was clearly one of the things in terms of being discharged for was being dead. I thought it's not a very happy subject to talk about but if that's the type of thing we're doing, we're saying we have to be comprehensive about it and clearly you leave the hospital, you're either alive or not. Again, the model here allowed us to circle in on that particular topic and allow people to see how that actually worked. We can take it a step further, diving deeper down into the process and look specifically around a particular entity. You'll see this entity called BED had four attributes, BED.description, BED.status, BED.sex to be assigned and BED.reservation reason. Notice also below it, it has an association with a concept called a room. A room can have zero, one, or more beds in the room. This is another area where we discovered that there was a little problem with the way in which the system was being specified originally. One of the things that was going on in the early 90s was that people were discovering this new technology called RFID. And RFID, while it was really neat, they thought this would help of losing people in hospitals. I don't want to scare you guys, but it does happen occasionally where somebody will get lost in a hospital. And we pointed out to them that if a bed could contain zero, or excuse me, a room could contain zero or more beds, then we were going to use the RFID to track where the bed was in which room and what else had to be a room. And it turned out, oh, that many hallway had to be a room or an elevator. And an elevator. Exactly right. And of course, that was what really blew their mind because they couldn't figure out that every room had to be on the floor unless it was an elevator in which case it didn't work. So not that this is bad or good, but it caused some questions to be asked. And it's hard to ask the questions when you're at the conceptual stage as opposed to I've already built it. Now I have to go back and do what I did and get back into it the other way. So that's conceptual modeling at a high level. Now we're going to move to logical modeling. That's right. So the logical model is really just sort of the next level down. We're now going to represent the conceptual data model and it should be very close to the conceptual data model. So we're going to be a bit more thorough. In this case, we're going to include attributes. We're going to include names of entities more clearly. We're going to explicitly identify the relationships and any other metadata that we might want to put around this. This rule right here, I think, is what most business users will still understand and certainly all IT workers should understand. So this is really your common ground here, in my opinion. This will be developed using a data modeling notation. For example, you and I will typically go to, although I've seen lots of other things from scrolling on napgames and so forth, which is not the best practice. So in this case, you now see that customer has an address. Not really there is the adverse relationship of that, is that address could have many customers. So this is a mini-to-mini relationship between the two. You can certainly think of plenty of real-world examples for that. But we've now gone beyond what a customer is by explaining exactly what attributes make up a customer. Clearly, this is not exhaustive being just an example. But in this case, for this business purpose, we've identified a customer as having a customer number. It's probably their key, possibly a surrogate ID. We're going to have a social security number, their first name, their last name, their valuation, and a phone number. Right now, I'm actually arguing the phone number should possibly be pulled out of there, but that might be over-normalizing. And then the address is going to have a street, city, state, zip, country, pretty straightforward. But those are the things that identify that entity and how the two relate to each other. And notice we don't have anything about the customer's dog in that particular place, which is an explicit business decision at this point to say that we're not going to talk about dogs, not that we have anything against dogs. It's just not part of the business problem at this point. So it's to include and by definition, and then what you exclude also. At a sexual level, we weren't looking at that specificity now that we are. Right. And without going too far down that rabbit hole, right, how much redundancy do we see out there? How much data should have been excluded in the first place, but we don't need to go down that right now. Well, but it is interesting, and this is, Lewis and I argue about this all the time. I mean, I think that about 80% of the corporate data out there falls into the category of redundant, obviously, or trivial. He thinks it's closer to 95%. Yeah. So either way, the two answers are not good. Right. So conceptual, logical, of course, everything is the next step is physical. That's right. And in the physical data model, we're going to describe the specific database implementation of the data. I've seen physical data models really vary pretty wildly between organizations, and even a couple of different standards organizations think have sort of different definitions. I'm going with the fairly middle-of-the-road idea that it really is describing within the specific technology you're using, so something like Oracle or SQL. You're going to be using that notation. Consequently, the attributes will now be named according to either your business or your technology naming conventions. You're going to see your data types. You're going to have accurate table names. You're going to see information. Your relationships are going to be built out further into the structure that's actually going to be represented as physically. And so you can see the slight variations here between this and the logical. So we now see the naming conventions, that of the keys, and so on and so forth. Yeah. One has higher levels of communication in which are the businesses, whereas this is clearly more of a technical-focused piece on there. And before we move on, let me bring everybody up to date with one piece that has happened. There's Steven mentioned in the last slide, UML. In the data modeling community, there used to be a bit of tension around people that used UML because UML has some challenges originally where it was not really representing the full robustness of what we do in data modeling. And the people at OMG read out to DAMA and put together a group. David Hay was instrumental in working that particular piece. And now do have an improvement in UML that allows us to do robust data modeling in there. So that's the first thing. It used to be 20 years ago, people would go, oh, I'm not going to go into UML because I've got to get really hairy with my normalization. And if I get that, I'm not going to be able to do it in UML. The answer is now we, in fact, can do that. That's right. So now we mentioned earlier, we're going to just talk a bit about some of the consequences of poor data modeling. And they can be quite vast, actually, if you think of issues around that. So right after that, poor data modeling up front can cause data quality issues downstream. We all probably realize that if the model isn't the true representation of the business concept, it's going to impact confidence in the data. And we see that quite a bit, right? If a worker says, I don't trust that data, well, then that data is as good as useful to them, right? The potential for poor database application performance for reads and writes. So, you know, you can normalize. And we're going to talk a little bit more about that. But if it's taking ages for your beautiful data model to do what it needs to do, right, if you're trying to read an application, and it takes you five minutes to write one customer, well, then you probably will rethink your data model or your technology and your architecture for it. It's a good place to start with looking at the data model. Yeah, we would get the question oftentimes. Can you come help tune our Oracle system and we'll try to figure out that it's not the Oracle system. It's problem. It's the stuff you're putting in the Oracle system. That's right. Yeah, I'll just give a good example of that. Right. And I think this is a big one we run into. It's this lack of flexibility in your data model can cause difficulty aligning with evolving business requirements. And one of the things is you constrain yourself within the data model. You didn't think about the future being flexible. And so what happens is you wind up overloading fields. I mean, how many times have we seen an e-mail field and something like, I don't know, a date of birth field or just something where it really doesn't belong. And that causes problems for people who that hasn't been communicated to. I can stuff that in that field without asking IT to fill me out a bunch of terms and triplicates and all the rest of the things. And of course, that's where you end up with real real big problems. Okay. Really quickly, just because it's fresh in our minds, I'm going to respond to one of the questions. Is there anything that can sense this on reference tables? And I absolutely agree with that. Right. The state as a free type field with a reference table absolutely does not make sense. And I'm a big fan of those tables. And especially nowadays that there's so many data services out there, you can actually get USPS information for addresses and make sure they're validated. So just to answer that question quickly, yeah, I think that example certainly is something we could really scrutinize over and over because again, I think that states should be separate. Really nowadays, I think address should be a validated field to answer that in the middle, but I thought that was pertinent right here. Absolutely. And then what are we going to do when we have the 51st, 52nd, 53rd states? That's exactly right. Yep. Everybody want to go back and reprogram that? That was a good question. Difficulty integrating data in the future. I'm going to be honest with you, this is probably the biggest issue we face day in and day out is integrating data. I think it's a little bit of a recursive problem. What happens is someone doesn't trust the data. So then they build their own system, but separate, but need their own needs. They don't trust the other data. And then that happens again in the next office building and that happens again in the next line of business. And then some executive says, hey, I want to put these things together. And then someone says, ooh, you know what? Not real sure how to do that. Who are you going to call? So that's a big deal. And that's, you know, especially nowadays as we're moving towards, obviously, we very much believe that not having a grasp on your data has gone to be a major impact in your future. So, you know, it's a big one out there. We've kind of talked about that already. It constrains your business agility. I guess I won't really revisit that. I think you guys understand that it's basically tying you to a model if you're not careful early on. And again, if it's the model that wasn't really well thought out, then you're left with nothing but workaround in the future. That's right. The real problem with software packages is they tend to have the data models not as well thought out as a data model or might do it. That's right. And speaking as a developer, I've never, ever, ever done a workaround. Right? Okay. I got fingers crossed on that one guy. And again, it can create operational and efficiency. I think that's more just feeding into what we've been saying. Now, this is an important one too, right? It does limit workflow transparency. It can be difficult to tie everything together when you've crossed multiple systems or multiple silos because you had to expand that way. So, I think that's pretty good. Do you have any thoughts on that? Because I know that's something we've run into in the time again. The real key there, I heard Tyler has described differently at a conference last week. Right. They're considered to be cylinders of excellence. That's beautiful. But you're right. In the sense of tying them together, we've got a lot of different parts. And probably many of your organizations have the same situation where you have data in separate places and somebody sort of has chocolate and somebody else has peanut butter. Unless they happen to round the corner and bump into each other while they're both holding a pile of fudge and a pile of peanut butter, they never really know that that was the case. That's right. So, unless you, in gender situations, try to have regular meetings. This is a part of data governance, different webinar topic. We did that one actually last month. But unless you have a mechanism for saying, come to the table and share what data you have with other people, it becomes very, very difficult to understand these things. And then people find out by happenstance that we'd rather have them find that program out. That's exactly right. I think this kind of supports what we're saying. We just know that it's impacting actual business processes. It's impacting your insight. If you can't get a full view of your customers, that's a problem. Everyone who wants to jump to this new trend, they want to have big data. I want to have predictive analytics. Well, guess what? You can't get to that. You can't answer a simple question of like, who are my customers? That sounds dead simple, doesn't it? To say, I'm an executive here, and I want to say, you know, my customers. How many customers do we have? That's not an easy question to answer when customers are spread across three different sizes and they have different definitions and different systems. This person hasn't bought anything from me in 24 months, so they're still a customer as like, we're getting a bit into business rules here, but I think it's very pertinent for the data model. I think it's one of the things that's happened on one of the projects that Lewis and I were on many, many moons ago. We were implementing a data model that was going to go into production, and we discovered an error in the data model. We told everybody, you know, sorry, we just discovered this. It was a late addition, a late requirement, and when we incorporated the requirement, we could tell that the data model that we're about to field is inadequate. The management said, that doesn't matter. It's going to go out Monday no matter what. And we were able to add up the amount of overtime that it took the Staff of 300 programmers over the next six weeks to go back and do it. And the management came back and said, well, we should have listened to you guys, because clearly we know exactly how much overtime we just paid over the last six weeks, and it would have been a lot cheaper than delaying the project by a couple of days to get that particular piece of it right. That's exactly right. Yeah. So let's move on a little bit. Okay. I'm going to dive in a little bit now into normalization. I'll let you take this one away. Yeah, we're not going to repeat for all of the wonderful data models that are out there with the normalization processes. I will say, though, that as data modelers, it's not the conversation we want to have with executives per se, but it is important to be able to explain the business consequences of not normalizing. And because it's taught so happily in most educational circumstances, even with good, smart, IT people didn't necessarily get a good education in that. And how it happens, it makes it harder for everybody to have this kind of an understanding here. So the basics are really what we're trying to do is set up the situation here where we would love to have the best majority of data models analyzed at third normal form. That's right. And this is part of what most people don't around to understanding, and be normalized, of course, for production constraints, and you alluded to that earlier, Stephen. That's right. So what we teach in college and university here is a model that looks kind of like this. I use CM2 because I use that abbreviation of common metadata modeling on here, which is really what we're talking about in terms of everything. If we start at the lower left-hand corner, we have our physical as-is. We then are taught to move to the logical as-is. If we're not going to change the requirements, we can move to the logical 2B. That's quadrant three up here on there. And then finally, the physical as-is model. Now, we know that as data modelers, most applications people do not understand the need for that and they go strictly from technology-dependent physical to technology-dependent physical. That's the problem. And if we can't explain the reasons why that doesn't work, in terms of meaning things to executives, which means you're usually explaining in terms of knowledge or bodies one way or the other, it is going to be a problem for us. But it's a little bit more interesting, too. Even if we do understand this model, let me take you to that same model and do slightly differently with the whole thing. So, again, we start out here. This little green blob that you're seeing on your screen now goes from your physical as-is to your logical as-is, your logical as-is to your logical 2B, and your logical 2B, right? Then you go to your typical 2B. All right, you get that. That's the way we tend to teach it. But what really is happening here, we're not doing this unless there's a change. There's somebody that said, I need to have some sort of a change. So if we go up to this logical as-is situation, something else happens here. We have other logical as-is data architectural components, other models that are out there that need to be integrated. And you mentioned before, how difficult it was to integrate things if you didn't have transparency into it. Profiling can help give you some insight into those ideas. So profiling is a tremendous adjunct for the models. But if you look here, I've got some green and some orange requirements. So what we really need to do is to take that logical as-is model and transform it to where it incorporates these new capabilities. So I've done something clever here with PowerPoint and just sort of made it look like it's green and orange in that. Because then we take that new revised model, move it over to our logical 2B, and then drop it into the implementation context. And if we get away from this a little bit, we can actually look and see that anything we do in our model transformation is going to be constrained within this particular framework here. Again, as-is and 2B, and at the bottom, conceptual, logical, physical. I've done one more thing in this as well, and that's that I validate the models as we're going through. And that's the process of asking the users to try to prove us wrong. Now, of course, if you ask them how is it, and they say fine, no information is transferring under those circumstances. But if I say what's wrong with this model, everybody loves to be a critic, and they can dive into this. So this space here that I'm showing on the screen really conceptually represents the world in which we operate in there. So let's go back again and look at Third Normal Form and specifically talk about where things are within that context. That's right. So we'll start with, I think, what Peter touched on is kind of what you all are probably familiar with, seeing in the wild, it's certainly probably 90% of what we deal with. This is just standard Third Normal Form, our relational system, where each attribute in the relationship is affected by the key, and this is just like the examples we're showing earlier. This is a highly normalized structure. This is the structure that for many people out there, just because it's been down for ages, it makes no sense. It meshes in your head. Sometimes I find after I've done some intense data modeling I've learned to see everything in my life in Third Normal Form. I'm sure you guys have done that before. No? No. They can make it. They're making more. So in typical use cases, transactional systems, operational data stores, lots and lots of applications out there in Third Normal Form. Just given a quick example here of a record label, and again I think you'll even see some of the questions in the Q&A. This is certainly not even fully normalized. Again, we're seeing address, information at the same record as the record, or at the same table as the record label. And in fact, what if a record label has multiple addresses? So it's not a perfect model, but I think you guys get the idea as an example. And what we're talking about in Third Normal Form here, what you're really talking about is a class of modeling to do problem solving. That's right. Third Normal Form is a way of fully understanding the model at its most flexible, most adaptable situation. If we go back to the slide a couple of weeks to this, you saw it had enforcement system forms up there. These have some additional capabilities that will allow us to draw out specific business rules. But if you get to Third Normal Form, that's generally a good place to look, to stop, take a pause, and see if there are conditions for the model we need to go further with it or not. That's right. So from a traditional perspective, where most people are taught, some people say that they know how to use, and it's a good solution for many types of problems. That's right. And I think if you hit the next slide there, we do have some pros and cons. So as a pro, it's easily understood in business and in users. Again, this is where you're going to find most of the expertise up there today around. There's reduced data redundancies. That's handy, particularly if they come typically, theoretically, it takes up less space on the disk. You've got nice and forced residential integrity. So if you're doing things like someone mentioned Q&A, you have your lookup tables, your reference tables. And the index attribute is inflexible, and that should be at the top in my opinion. And I think the biggest strength of this is the technologies that support this really provide for some very rich clearing. I mean, you can take a relational system and ask a lot of questions about your data and get a lot of ad hoc answers. So moving on to some of the cons. Joins can get quite expensive. We mentioned that earlier. We have this example here. Neo4j has that as an example on their website. They call it the join bomb. And you can just see how ugly that is. And consequently, it doesn't scale well. So if it's a user, a single tenant application, it's probably going to be fine. But if you're talking something like Facebook, like something huge, then it's probably not going to do the job. It's going to choke. So when we're talking about the modeling at the third normal form, one of the things, again, that's important to do is to make people understand you have to have some time in order to do this. You can't just put the model out there going from your logical model to this, but you really do need to analyze it. And again, if you've got a business example, at least in this company that Lewis and I worked for, that I did an example a few minutes ago of having the wrong model, they understood that six weeks of overtime and everybody working on the weekends was really a bad outcome for the organization to correct a problem that was easily definable. In fact, we had defined it upfront in order to do this. It's a little hard because when we start talking about third normal form, you're trying to put more boxes on your chart. It looks like more data. And you've got to be really, really clear to make sure that people understand just because more boxes doesn't mean you have more data. The boxes represent a class of items and we're breaking it down to a different level of granularity that gives us the ability to end up with over less data. Again, each table that we have out there in the system should represent one fact and one fact only out in there. Consequently, there's not a lot of redundancy, which there's plenty of pros for this, absolutely. However, seeing as businesses evolve and as the web continues to go, we are saying that it's not necessarily meeting every business and you might maybe it used to be able to come close anyway. That's right. So we'll go ahead and move on to the next one that is probably what something else you guys are familiar with, maybe not everybody, but certainly not anyone working in the industry or doing any kind of analytics. It's the concept of a star schema. This is comprised of fact tables, they essentialize fact tables, and there's really quantitative data other than any number of adjoining dimension tables. So those dimensions are rapidly changing or they can be sliced and diced on when you're doing analytics. As I mentioned, star schema is absolutely optimized for business reporting. So a couple of use cases, online analytic processing, business intelligence, if anything you want to add to this one. It's really just people when they look at star schemas, we try to speak to them about n-dimensional problems. You're showing a three-dimensional problem here in many instances. It's 12 dimensions that are required, but then you also, as you start to add that complexity onto this, now you get back into your joint bomb. Again, I like that particularly. You're probably going to write a song about that. It's funny. But this too is the important part of it. If you let the users specify what they want, they'll give you, wow, I'd like that one of these and one of these and one of these you've had to deal with these requests before. The answer is yes, we can build it, but it will cost you something. So again, if you're going to have a billion joins that are going out there, by the way, star schemas tend to avoid a lot of these things. So what you're trying to do is pre-structure and anticipate what the queries are going to be, but you can also prioritize them so that while everybody may say, oh, I've got to have all these dimensions, if you actually build the model, watch it for 12 weeks and discover that they're only using a little bit of the functionality. That's right. Then you can relax some of those constraints in your prototyping board. That's right. And that's exactly why you think datamarks kind of come out. These are the marks that are specifically built to answer specific questions. So if we go to the next page, we'll talk about some of the pros and cons there. The pro is it's a fairly simple design. Now I should have put an asterisk on that. I do find that sometimes people have a little trouble wrapping their head around a star schema, but typically it's fairly understood. Very fast queries. Most major database management systems are optimized for star schema design. So you are seeing a lot of support out of the box for a lot of DBMSs that people already have in place. So a well-established pattern of people familiar with that was very helpful. You threw something in the book, that's somebody, and it's going to take them a little bit of time to become comfortable with it. That's absolutely right. Some of the cons is questions must be built into the design. And you know, again, I wanted to put a start on this. You can also see that as a pro if you really want to contain and know exactly what your data is being used for. But it's a little bit of a lack of flexibility. The questions are by design built in. Data marks are often centralized on one fact table. Again, that's kind of a weak con, but it fails, right? So that's it. That's right. If you want units, that's a very different type of design. And so it can come up with some integration issues, although you can just build another start schema for that. But again, it's a little bit of a lack of flexibility, very specific problems that these are solving. So those are the first two. What is the need for the third method coming up? That's right. So this is why we're actually pretty fond of this data blueprint. Again, the entire team is certified in data vault methodologies. That's right. And so the data vault is really designed to facilitate long-term historical storage, focusing on ease of implementation. And we'll talk a little more about that. But the idea is these can basically be rapidly stood up because they don't really care if the data structure changes much, and I'll explain that in a second. They retain data lineage information. That sounds like Nirvana right there. Yeah, I know. It's not all bread or milk and honey, but it's pretty good. The concept behind it is this all-day data all the time. And it's a bit of a hybrid approach with Inman and Kimball. I believe they have links that sometimes, maybe jokingly or not so jokingly, refers to it as third and a half normal form. So it's a little bit differently normalized. And here's kind of the basics of it. And again, we're only kind of touching on the surface of this. So if this is something new for you guys, please don't hesitate to reach out to me. Or we can put another webinar together on this topic. Because data vault is pretty interesting. But the main concept is it's comprised of these hubs. Hubs contain a list of business keys that do not change very often. So see if it's an example of a customer in order. These are hubs. Then you have your links, which are associations between the hubs. You're seeing right there the customer. And the customer has an order. So that's what's relating to them. And then you're seeing satellites. And those satellites are just script attributes associated with hubs and links. And those satellites are interesting because they can basically change as much as you want. Because they're always relating back to a hub. They can add new fields. They can remove data and add new ones. And you're always going to have them. I'll answer this question really quickly, because I did say it. You can reach out to us. You can reach me on Twitter at sjmclochlin. My name's hard to spell, but you can find it at the beginning of the presentation. You can reach me by email at smclochlin. No J on that one. At datablueprint.com. Yeah, that's it. No, I'm not going to make an alias for my beautiful last name. It's going to be long. So I got some pros and cons of the data vault. I got some. It was a little hard for me to find the cons just because I'm really smitten by data vault. But in fairness here, it's got simple integrations. Because of this all the data, all the time approach, it's really quite simple to integrate. You can house a mince amount of data with excellent performance. That's right on the hood right a little bit inside baseball. And it's got your full data lineage, which can be really important for auditability, things like that. One is, because it's so simple to integrate, the complication is really pushed to the back end. So it is going to require some ETL work to really get what you want back out of it, I think. It can be difficult to set up for many data workers. There's just a lot of expertise around there. And I think that's going to be a compound that holds through a lot of data. And I think that's going to be a compound that holds through the rest of these examples, because we're going to talk big high level about some SQL solutions. And there's not a lot of widespread support for ETL tools. That's really kind of changing now. I think we're seeing newer and newer ETL tools are supporting this out of the box. But I think that was worth mentioning at least. One of the things I like to think about is the use case for data vaults. It's an organization where they're set with a series of regulatory requirements. Regulations change. You don't want to go back and normalize the data warehouse to come up to speed with all the regulations. So what you can essentially do is freeze time at certain sites, right? And you won't have to go back, but you can still compare apples to apples. That's right. It was in the modeling method a lot of those lines. So let's begin with Nose Seek. We'll talk a little bit about some of the things that are hype-full around it. And again, one of the things I like to introduce to groups is if you're not seeing it, Gartner's five-phase hype cycle methodology starts off in the lower left-hand corner there with a little green ball as the technology triggered. Cool! Somebody says, you know, some of the data is more important than other data, right? So we get this great idea, and of course that gets us up to the peak of inflated expectations where the hype outlives the actual business value, which means the next thing that's going to happen after lunch is that you drop into the trial of disillusionment. Oh, it sucks, right? Well, it's neither really good and it also doesn't suck because we really need to find out where it is. We do that by climbing the slope of enlightenment and eventually landing ourselves on the plateau of productivity. So again, a nice little description. I just wanted to show you that because you have to see where big data falls into that category. Now, this is as of a couple years back, but it's still fairly relevant. Text analytics, for example, has a where to go but up. If you're doing text analytics, that's a great place to be in because the future has already been through that, wow, it's just going to solve everything or it really sucks. Now we're getting back to what it really can do. Similarly, if you're in social network analysis, the implication here is you have to go through a rough ride. On the other hand, predictive analytics and web analytics have been well done so those pieces are relatively mature in terms of it. Now that you've watched that, one of the things also to keep in mind here is that Gardner reminds everybody that the focus in on used big data techniques is not a substitute for the fundamentals of information management. I love that they include that. Yes, that's right. Similarly, if we look at July of 2012 and what's going on in here, we can see that big data was two to five years away from P-Type as of July 2012. Interestingly, just 12 months later, July of 2013, big data now five to 10 years away from P-Type. This is not to say that this stuff doesn't work. There are plenty of companies that are achieving very good business values by incorporating big data technologies into their architectural platforms. But how you're going to do that and how you're going to cut through the cycle that is here is something that you all are the best people who are able to come up with that. You know your business. You know these techniques. You can look at this and say, that tool will solve this problem and that's tool will solve that particular problem. So that's a specific example of this. I just quickly want to introduce NoSQL. If any of you out there are unfamiliar with it, I do encourage you to take away from this a desire to read more about it. And there's a great opportunity that actually DataEd does. There was the NoSQL Now conference in San Jose that I got the opportunity to attend. It was really quite interesting to see how this technology is maturing. Now if some of you have heard the name NoSQL and thought, what does that really mean? It's a bit of a misnomer that has kind of been backwards change to say not only SQL but a lot of these do now support the SQL query language. And in fact it's kind of a catch-all term for a lot of fairly different yet similar technologies that all kind of share something different. They're almost non-relational although that's even a bit of a misnomer. It's really, it's a bit of a muddy water to navigate and again feel free to write it out to me if you're interested or Google is certainly your friend on this but we can tell you a very technical definition to kind of blabish, right? In some cases I would say so, yeah, yeah. Great definition, yeah. Okay, so we're going to talk about document and key value here. And while they are different they're similar enough to what I thought it was worth explaining. So kind of the key concept between both of these is that they're scalable thanks to distributed hash table. And what that really means is whether or not no SQL system because you've got your data stored in all these different tables that's very difficult to get clustering. Now I should add a quick caveat here. Almost every single technology vendor is working to address these cons. So I say this and I'm saying very much in theory because it's probably changed by the end of the webinar. Exactly, right. So many SQL providers are attempting to account for these cons. And some of the cons you'll see me mention here there are specific technologies that work to address those. I'm talking very conceptually here. The idea is that it can be distributed nicely, right? Especially for something like web scale for passive multi-user, multi-thousands of users' applications, something like Facebook. You can have your house all across different servers, different geographical locations, whatever you may need. Now the other kind of key features is the flexible, largely schema-less design. Again, a bit of a misnomer where really many of it seems to be really handled by your application logic. They don't necessarily have to rigidly adhere to something you've defined. And so let's see this example on the left there. That's really what a document might look like. That's a JSON. I've also heard it pronounce JSON. I'm not sure what's correct. That's a JSON document that basically will have a key and then there's a blob of document information. And it looks a lot like something like Excel if you're unfamiliar with seeing JSON. And you basically have all of these attributes actually grouped together now. So you might see Peter, right? He might have his address as part of his own document and it's a very de-normalized structure. And what that means is very rapid access and it allows us to store them in different locations because of that distributed hash table. And so you're seeing the document concept on the left, and on the right we have a key value where it might be something much smaller than a document. I really think they're very similar in concept, but typically a key value pair might have something smaller as the value. It's basically just a two-column table, right? Your key could have something like username underscore first name, username underscore last name. So that could create a key for us. We might have Peter, Agon, first name and that'll bring up Peter. And then Peter, Agon, last name, Peter, right? So I hope that makes sense. I realize that for some of you, this may be very new information and there's a lot out there and we have a bit to cover, so I'm going to move on from this. But quickly, the use cases, as we did talk about, applications with many users, many rights. Very good for agile development. Things like games, apps, or mobile apps, because they are very empowering to the developer because these are schema-designed, the schema can basically come up as needed. Okay. So the devalued pros and cons. Again, empowering the developer. And I put an answer list on that one because that's not always a good thing. No offense to developers out there. Very scalable. Well, one of the developers can achieve miracles. That's right. Yeah, exactly. It's got high availability and it's economically viable because you're scaling out. You can use a lot of economy, right, economy processors as opposed to scaling out because you're buying the bigger and the better. And most people are supposed to do that. You're talking specifically about parallelism. Absolutely. Yep. And shutting in auto-shutting and all those fun things. And as an example, we have a Hadoop cluster here in the office. It's run on a bunch of old laptops that we just scrounged from around. Pretty interesting, really. And it's got quite good performance as well. The cons. It's really got some poor ad hoc query and analysis capabilities, right? It's very difficult when you look at that document structure to say, finding all people who live in California. Yeah. It is cool. You can index a particular segment of the document, but it can really be very, very complicated. There's a bit of a lack of maturity with seeing that change. I think every day that goes by, more and more people are gaining expertise. You know, businesses are trying to help people are gaining confidence in that. In many cases, it's eventually consistent. And without going too far into the concept of that, basically if I write something and it immediately loads what I just wrote to, he may not see my change. And a great example of that might be like a Facebook comment, right? Where it doesn't need to be immediate. I just posted on Megan's Facebook wall and she won't see it for five to six seconds. I'm not okay. It's not mission critical. See, the eventual consistent gets into the difference between asset versus base. That's exactly right. Beyond what we're going to do here, we're going to need some keywords to look it up. Yep. I started to go over that in this presentation, but I took it out just for time. So here's a couple other notes, equal solutions. And again, I'm just trying to be humble here. We have an RDS triple store. These are purpose built to store triples. And I put, as an example, something like Bob likes football. They have a query link that's specifically built for this called Sparkle, which is really fantastic. It's just got some amazing capabilities. It's one of the pillars of the semantic web, right? The idea that the web is this, well, that it's this massive, you know, unstructured data that if we could harness and be able to query it really intelligently, right? It is limited power at our fingertips. Read the right books, and I think it's there already. Exactly, right. The key is to apply some of these techniques to it. Right. So then we have a graph. It's a structure composed of nodes, edges, and properties. And let's see, it's actually pretty similar to a triple store. And it's focused on the interconnection between entities. Fast query is defined as associative data. This is something like you might see in a social network, right? Relationships between people and how you can traverse those relationships. So you might see, you know, Steven is a friend of Peter, he's a friend of Meg, find all the friends of Meg, and find their friends of friends, right? It's easy to traverse that with this data structure. It's with something like a relational system. That might be really quite complicated. In terms of pulling all the joints together, right? Especially if you have millions of records. Whoops. Sorry. We're making errors there. So what is it? So next, we are going to talk about column family and those of you who are most familiar with relational, third normal form, the best I've had column family describe to me when I was having a lot of trouble kind of wrapping my head around it, is think of it like a view. Right? So your actual data is a view where you've taken multiple tables together and combined them into a view. Basically, you can have unlimited columns on a row. And so columns are stored individually, but they're collected by family. So as an example of that, you might see Peter again. I like to use you as an example. And then you might see a column family of address. It features the column, city, state, zip, and it has a column family of car. And we'll see the attributes for your car under that. And what that allows us to do is have unlimited columns for any one record, but you can create only specific column family and still get pretty rich queries while excluding the other data. So we can basically have nearly unlimited numbers of columns without causing these expensive queries. Pretty interesting stuff. And so I just wanted to show you some examples here. There's an RDF triple store over there, like Subject, Predicate, Object, Bob likes hamburgers. And you've got a graph, and like what I was talking about, right? That dude is friends with that other dude, and that dude likes sushi and plates serve sushi. So you can follow these relationships and how they relate to each other. So you can see how you can traverse these questions, which is really interesting. You can find what friends of that dude with the beard likes sushi, and you can traverse the graph and find that out. Down on the bottom left there, you'll see an example of the person table in a column family. I think I pretty much went over that with Peter. Pretty interesting stuff. Now over all these, they'll want to give you this in the slide deck. If you're interested in those, there's a lot of technology to check up there. And there's a whole lot of variability, really. So just five minutes left. Let's talk about how all this is changing. Absolutely. The key of this is that we want you guys to understand there's lots and lots of starting points to come from. We use design patterns as ways of allowing people to understand this. And when people say, why should I invest in these patterns? You can ask the question, have you ever noticed in large office buildings the restrooms generally are in the same place in every building? It's a visual clue and people know where to go, but in addition to that, it's also cheaper because if we put the infrastructure in one corner of the building, we don't need to run out. Certainly it would be cheaper to have a loo in everybody's office, but probably a little bit impractical. And of course we don't just stop with the loo. We repeat that for electrical, wiring, floor plans, et cetera, et cetera. Well, all these architectural patterns have different types of models and things. And I just wanted to point out here that David Marco, myself, David, hey, a lot of other people have written on these patterns in here. Some of them going as far, and David Marco's book is including a disc with these patterns on them that you can get to directly into Irwin on this. So you've come up with some rules of thumb here that I found very, very useful. Yeah, that's right. I found these somewhere on, I'm sorry, I can't remember where. I found it on some blog ages ago and it really stuck with me. But they also come out true in what you see, though. That's right, absolutely. So we're finding it roughly a third of a data model contains fields common to all businesses. A third contains fields common to that industry. And the third is really specific to the organization. So I think that's pretty interesting really to kind of absorb. We'd love to get some empirical data on that. That'll require a different webinar to go over. So patterns should theoretically provide an organization with a baseline to quickly develop data infrastructure. Are we seeing that in practice? Sometimes there's a lot of ERP off the shelf solutions that certainly do meet that. But theoretically it should be easy now. There's so many people who have done so many solutions out there if only we could just harness that. Off the shelf solutions may require in-depth customization. And in many cases we see that off the shelf winds up being just as costly time or money as just building something from the ground up. And that's really a bit of a shame. It's kind of like the outsourcing thing. Everybody went, the outsourcing was going to solve all of our problems and we discovered you were going to solve all of them. So now we're back to our research from what's right now. So that's what we want you to take away from this. Yes. Now we'll talk about data as a service. This is based on the concept that data can be provided on demand to any user regardless of geographical or organizational separation. And I think a good example is what I was talking about earlier with the USPS data, right? Theoretically you could be a problem by actually having address tools defined in the mail. That's not entirely accurate. But basically you have something at your service or at your fingertips to service that divides that exact data you're looking for. So consequently, especially in schema list data, you can enforce what I've been calling a post schema on your data. But you can shape the API. You can shape the way the data is standard or the way that it's subscribed to in a particular way to make it fit the goals that you need to fit. It's flexible and inflexible as necessary. So when you come up with a data logistics network, that actually gives us a topic where we could do another entire webinar on just that one topic about being able to add schemas to things after piles of data have already been created. That's right. So adding structured information allows us to obtain exactly what we want, when we want it, and APIs allow applications to serve updated to external sources in a structured way. Again, this term post schema. We did talk quickly about scaling out and not up. I'm not going to cover this too quickly, but the idea right is by having this distributed hash table, we can have our data housed in multiple locations, multiple disparate locations, which means that you can scale out, not up. Again, I encourage you to do that if you want to read more about it because we are just about out of time here. And we've got that term auto sharding, which I'm going to be honest, I'm a grown man, but I always like it when I hear that. That's okay, right? So we're right around back at the top of the hour here. Again, what we've looked at is really the relationship between these things, the different models and the different uses, and the fact that these things are changing. We've got a couple conclusions we want to finish with before we get to the Q&A section. That's right. And that one, which I know you've heard of Hammer Home, is very important to get this modeling right. But you can't leave the explanation there. You've got to tell what the impact on the business is so that people understand. Again, an example we went into a little in-depth. They understood how much overtime they paid to staff that was working very hard. That's fine. We never actually got to what the impact on the business was on that. We were able to make it very clearly a local case. But for you all, you've got to be able to say to somebody, think of it like an elevator speech, we caught that last year, the reason was because the data model was wrong, or whatever it is. Right. And you've got to be getting it right. It's hugely dependent on the business case, the maturity of the organization, the flexibility for future growth, and so much more. One thing that I really wanted to touch on, but we didn't get an opportunity to, is really this NoSQL stuff to drive it home. A lot of that is very application-centric. It's really letting the application drive the data model for better or for worse. Clearly, there's so many technologies and ideas out there now to solve a number of problems. And they're really solving very specific problems and combining them all in very clever ways can really solve a huge variety of business problems. And your organizations are dependent on your expertise in order to help them solve those specific problems. The last thing, of course, we don't want you to model without understanding what the architecture is conceptually. If you don't have that conceptual framework, modeling can become an academic exercise, or you can achieve a correct solution to an incorrect problem definition. That's right. So we're at the top of the hour here, and Megan will turn it back over to you, and let's see what sort of questions we have. We've got a bunch of them out there. Thanks, Dada. Thanks, guys. That was an awesome presentation. Now it's time for Q&A. Time for you all to ask your questions. So just talk on the Q&A window if you'd like to talk with your screen, and you should be able to submit your questions through that Q&A window. I'll give a few seconds or a few more to come on and so give you a little bit of time. I want to get started, and the first question is, how do you respond to developers who do not want to reuse corporate reference data codes and say it's easy to create an XML mapping file? That's a good question. I guess you can't whip them, right? No, I'm just kidding. That's a tough one, and I'm going to cut it out a bit and say it really depends on the individual application, but if you're really holding your organization to informational excellence, I see no reason why they should not conform to what the rest of the organization is doing. If there's a compelling business reason why they could use XML mappings and kinds of stuff themselves, then great, but if you've got great DBAs and big stewards and people, it could cause complication, and I realize it's a very generic response, but... I'm willing to bet that most organizations, particularly as they get to be larger organizations, they're going to discover that economies of scale are going to be more and more important, and they're going to be able to make a case for a local work group having a specific set of things on there that when you look globally across the enterprise, and even if you don't currently have a need for it, it's probably pretty easy to forecast that there will eventually be one of those needs. That's right. That's one of the things that when Steve and I go into organizations, they'll be thinking, here's our three-year plan. This is where we're going, and we'll be able to look 10 years out and say, yep, and after your three-year plan, you need another three-year plan, and by that point, you're actually starting to approach some of these enterprise applications. Now, the most important thing, though, is let's not devolve this question into a war about which side is right. Let's get the facts on the table so that we can, in fact, look how many XML transformations are we talking about? How often is this going to run? When you start to put those numbers on it, you realize that this query is going to run eight billion times a day. Right. Okay, that's something we probably don't want to have running often if we don't need it to run that often. Exactly right. Yeah, so again, I think the easy response to that is it really kind of depends. Right. But again, you all are the people who are closest to the ground who have access to the facts and figures. On the more enjoyable parts about what we do with all this is that sometimes you guys will follow up with specific questions and we'll actually poke and prod and find out and come up with an answer to the fact that it might actually be not the intuitive one. That it's only going to happen four or five times and the rest of it's all pages and we don't have access to a lot of the internals. So, yeah, let's make it a high-level XML transformation under those circumstances. It seems to be the best answer for a group. That's right. Absolutely. But again, let's do it based on facts, not just based on a quasi-religion, but that's not the right analogy for it. I mean, I come from a development background. I'm low offence to developers out there. I'm low to let them make business decisions. They don't let me make any decisions but I have to develop for sure. All right, great. The next question is, difficulty in integrating data is the big challenge now in the biomedical field. How do we address this problem at the stage of data modeling when designing a database with the owner of the database when the owner of the database doesn't realize the future integration problem? That's a good question and it's actually something that we did quite a bit and just last week I was at a biomedical lunch and I talked to someone extensively about how he's having these exact kinds of issues. I mean, it's a great question. I mean, what you're basically saying is you're not getting buy-in on getting the model designed exactly up to the front. Is that right? Can you read the last half of the question? Sure, let me get it. Sorry about that. No worries. Let me get it back up. How can we address this problem at the stage of data modeling when designing a database with the owner of the database doesn't realize the future integration problem? I mean, how do we address this problem when the owner of the database doesn't realize the future integration problem? Yeah, okay. I heard it right. Unfortunately, that sounds like it's going to require some depth, maybe political maneuvering. I mean, that's really a business issue before it's even a data issue, I think, or it needs to be made a business issue. So maybe the power that we need to understand why this is important. And the key is to make sure that you present it to them in meaningful ways. That's right. So when I say meaningful, it's something that means something to the people in the corner office. So if the corner office is concerned with sales, then show them that the integration of this stuff will result in more sales and the lack of integration will be a barrier to improving your sales numbers. And I don't think you're going to get sales out of biomedical, although there's certainly some parts of it. Maybe what it is is matches in disease or matches in dynamic sequences. Lots of different things have happened there, but it's just incumbent on all of you all to say it's important and here's the business reason that it's important. All right. The next question is, given new platforms, what are the skill sets that a data modeler really needs today? And Audrey hasn't seen you out there for a while. That's a great question, too. And I think at this point, right, we know the old standby. I think fully understanding your standard relational systems is always going to do you right. And I think on into the future, I think there's plenty of systems that are just never going to be replaced and never not going to be relational. I think at this point, we're seeing these no-sequel technologies and these kind of, you know, emerging trends in data modeling that they're so new. I feel like at this point, I've kind of made the decision for myself, I'm really taking a breadth-first approach. I'm really trying to just keep informed about all the differences without really going down the rabbit hole of anything too far and seeing where that lands. I mean, I'm thinking in five, six years, we'll maybe see three or four major ones really coming to the front and that's maybe when it's time to attack it with more purpose. So one of the things I do when I'm looking at technology forecasting, which is part of what Audrey's question is relating to in here, is if you look back over the past five years in data modeling, we've probably seen more changes in the past five years than we saw for the 20 years before that. That's exactly right. So we are increasing the options, we're increasing the techniques, we're increasing the types of problems that we're adapting to. And that says we need to have people that can scale into these things because the college and university system is not putting out well-qualified data models that are immediately useful to the business. I'm not saying anything bad about what's going on in college and universities, but when you're dealing with toy problems as we deal with in the university, it's very different from walking in and you were working with something that had to do with bone marrow sequencing. That's right. There's just no experience that you're going to get in college and university that's going to prepare you for that. Or, for example, for an organization that has the kind of volume that some of these organizations are dealing with. When you do a query literally billions of times, that's enough to make even a fraction of a second a lot of difference in the organization adding up. Especially for the business space, right? And I think part of what I'm saying is I think it's important to know about all the different tools about this so that you can know that you're using the right tool for the right job rather than going to one of your old stand-byes. And I've really been relating this to sort of the code revolution, right? I mean, 20 years ago, there weren't that many programming languages. There's certainly not that many paradigm shifts. I mean, Bart, of course, a couple of examples. But now there's so many different languages and programming paradigms out there that it has led to increased specialization and a little bit of fragmentation. I think we're going to see a little bit of the same data modeling. Very nice. Good to know. Next one is, could you go over the difference between a data warehouse and a data vault? Yes. So briefly, right? A data warehouse is sort of purpose-built to have a single version of the truth, I think. Whereas I would say a data vault is more about capturing all the data all the time and being able to get to the truth. That's kind of a high-level look. I'm not sure if you're looking for something more technical than that, but that's kind of the way I always go into writing because your data warehouse is going to be feeding your individual data markets. And those are going to be asking very specific questions so your data warehouse is going to be gathering very specific data. Your data vault is more about just give it all to me and we'll figure it out later. Give me all the data now and we'll let God sort it out later. It's a high-level description of those, too. My question. Thank you for that. How do you address the performance issue where there is many-to-many relationships between entities? Well, kind of exactly what those equals all about, I think in many cases. Here are two approaches. The first approach is a structured approach where you model it, come up with the intersecting entities, building a set of structures, and everything is just right there. But the second approach, and that presumes that you know most about what your data is. So if you build answers, you know exactly what your data looks like. You can build a solution relatively easily there. That's right. More often today that the data is not as well known and it eats your speed, it's pushing it into other techniques, which is, that's right. And as I mentioned earlier, plenty of technologies are springing forth to try to address these needs. I mean, even things like materialized views and ideas like that can help with this. Certainly that's a problem that people deal with every single day, every day. So you can denormalize in some cases where it makes sense to denormalize. You can beef up your infrastructure, which is the tough bitter bitter swallow. Hardware vendors like it. Yeah, that's true, that's true. All right. Next question is, it says, scheme on a document or key value data store is handled in the application. Where does the logical modeling happen? That's a fantastic question. You know, I'm not sure I have a great answer for that. Especially in the application package though. It will be probably at the application point. Yeah, I mean, it should be handled in some metadata surrounding the application, right? I mean, it should be metadata surrounding that so that there's some transparency from buy-in. Realistically though, it really is happening at the application level. And the problem with that is that when you're in a largely packages environment, your data model is the one thing that ties together all the application packages. So I'm switching to a different presentation right now just to show you all the fly that I had developed that addresses some of this. The key here is that what you see happen is that the reason they can get these software packages to work on a variety of databases is because they're using the database for just table handling, which means important functions like referential integrity are implemented in the software. First of all, it means you're underutilizing your database. And secondly, the people who develop the software implementations of the database functions are software developers. They tend not to be data people, so they do a good job, not a great job of doing that. And of course then you ask the package to be customized and the person who does that work is a consultant who is going to leave the organization after they're done and you're missed without a lot of the context that goes into that. So short piece on that, we'll remember to include that one slide in the rest of the presentation so everybody can get to that. Okay. Next question is, do you think data modeling will be used more in data virtualization than physical integration scenarios going forward? That's a great question. I wish I had a magic ball around that. I feel like, and maybe this is purely anecdotal and I realize that adding anecdotal data to a bunch of data nerds is probably not the best approach. Oh, they love it. I used to hear all the time about data virtualization and I personally haven't come across it much recently in my professional career. So I'm not sure that I can really answer that one. I don't know that. I tend to say yes though because the one thing that is happening out there is that they've now taken the data storage technologies and put them also on Moore's law, which means the price of these things is going to go in half every two years and the capacity is going to double. But we can do that, means we can do more with virtualization so I'm just going to see more growth in that area just based on that particular scenario. All right, the next question is, most of the GIS data models are third S. Normal. Normal four. All right. But the same data is used for viewing in a large scale by a viewing application. I guess more. Yeah, follow up maybe. Yeah. I assume that's causing performance issues. Is that what's the same? Possibly. That could have been. Yeah, so I can see that. You're taking this massive amount of data to normalize. You're needing to change it all together, which is very expensive and it's probably got some performance issues. I hope that's what you were saying. Hardware is a good level, right? Right. Yes. The next question is, how do you deal with DVAs? I'm going to focus on performance, who are resistant to normalized design, use of cascade constraints, foreign key, use of cascade deletes, et cetera. So they think it's going to degrade performance. Is performance the only concern? I think the answer to that is no. Performance is probably not the only concern, but it's the only concern for that business case. Which leads us to a webinar we did a couple months back on data strategy. So the key piece of that is, what is your data strategy going to involve? If your data strategy is focused on performance, as Stephen said, then absolutely everything else focuses on performance. And we know of a couple of organizations where that's really important and we're really glad those organizations do that. It's probably not true for the majority of organizations. That's what I'm thinking, too. I mean, it's a difficult answer based on variety of factors, but I think in general, it's hard to put a general response to it. But again, it's very specific, I think, to your individual issues and the business case for it. But I would argue that the DVAs, his job is to try to argue for that, but there needs to be something from the other end, saying it's not always about performance. It's great that you're there and that you're an advocate for performance. That is important, but at the cost of what? At the cost of having duplicate data or introducing important data. If there can work for a ton of their buddies later on in their organization or a data cleanup organization that could become highly profitable. Find yourself a couple of extra milliseconds on your right through your reads downstream if it's causing a lot of issues of integrating data to get some business insights then it's definitely not worth it. On the other hand, if it's the difference between your customers using your application or not, then yeah, it's probably worth it. The next question is, what is the difference between a data vault and data lake? So a data vault statement is described to you as a class of modeling techniques that you can use to put in place to deal with rather complicated evolving systems under circumstances where a data warehouse might not be the best approach. One of these is the phrase data lake to talk about just the overall environment in the organization. The data lake is the way we like it. It's typically a quality type of a process where people think of, I've got all this data in a data lake and I've got to go in and clean it up. Well, if we're going to clean up the water in the lake, we probably have to find the source of the pollution because while it's a good idea to clean up the water in the lake, it's also a good idea to find the sources of the pollution and to eliminate them so we don't have to continue cleaning up the lake. Otherwise, it just becomes a guaranteed work for somebody and that's a wonderful thing, but... Yeah. All right, next question is, what advice do you have for a recent college grad trying to break into the data modeling field assuming they have a degree in the IT field with database management classes and data modeling classes? Networking. Yeah. That's all about networking and at this point, I think it's... You can really differentiate yourself, I think, by being able to speak kind of business's language. I think if you really focus on the IT side of things, that's great and there's a lot of people with those skills, but if you can really try to bridge the gap between that IT knowledge and tying that back to a business case, I think that can be a big differentiator. I would also say it's unfortunate that you're a recent college grad because I would have taken your last course in college in independent state and done a project for a company that you might want to go to work for in order to do that. Then you may have some opportunities here. Again, there's a number of conferences. We have one of our favorites coming up as Enterprise Data World. It's going to be in Washington, D.C. if you're in the proximity of Washington, D.C. at the end of March, beginning of April in addition to the Cherry blossoms being wonderful that time of year. We're going to have about 1,000 data modeling people in one place at one point in time. It's a great place to network and learn about some of these additional sessions that are there. If you have trouble getting introductions, let us know. We know a bunch of different people. The next question is, imagine that there is an ERP implementation project. Therefore, there is no need for a physical model. Do we still need a modeling tool and effort if schemas and tables will be created by ERP tools themselves? So it's clearly the cart before the horse, in this case. I hope that the organization had done it this way. They would have a logical model of the business requirements that they wanted the ERP to solve, and they would look at the various ERP options that compare the logical models of the ERP to the model that they have and give you a very specific example on that that we ran into in the U.S. military. One of the primary entities in an HR system is that an employee can be related to a position. And while that's an interesting, sorry, a person can be related to an employee. And most organizations implement that as a one-to-one relationship. In the military, many of the members and women of the armed forces have to have multiple jobs. So we have a requirement in the Defense Department that says that a person can be multiple employees. And looking for an ERP that has that multiple employees means we don't have to do less of work around than Stephen was describing before. And they're all lower total cost of ownership to everybody. So I want that modeling to be done to help select the package. Once you have the package, you're likely to be asking the vendor for those particular models. So the vendor will give you the models in many cases, and you can use those models to help implement the integration as well as the data transfer, data transformation from the old package to the new package as well. So the models, again, not modeling for modeling's sake, what is the business problem we're trying to solve, and then can modeling be useful for it? And that's what we're trying to do. And I think also just to follow up with that, we see so often people are jumping into tools to solve problems. And we think that really solving a tool should be further down the line. That may not necessarily be applicable in this particular case, but I think it is repeating that, you know, we come into organizations that have, well, we have all these different licenses for this tool and this tool, but we're still having these problems. Well, you have to address root causes before you determine a tool. And technology is one of the three legs we'd like people to stand on. The other are people in process. That's exactly right. The next question is, what's the consensus on product tables? In that address table you showed earlier, you don't have state and any separate entity. So for example, in my experience as a product table, would people and have a separate entity of product category? Yeah, I agree with that. And I think I mentioned that even in kind of mid-presentation. I think someone else mentioned that. I agree. I think that table wasn't perfectly normalized. I think reference integrity is hugely important and it's one of the massive benefits of going with a relational system. So I agree with you absolutely. You can have people managing that data ensuring that those reference tables are up to date and that they accurately reflect these business concepts. So in general, I am all for complete normalization when it makes sense and having absolutely everything in a managed reference table. All we're doing today is like some cases where that's not necessarily the right answer. Sure. Go ahead. Okay. I was asking about the consent between ER and TRIPLE. So the key there of course is that if you're understanding the way open data and linked data is working in the real world, almost everything is being stored out there as an EAV TRIPLE and attribute relationship TRIPLE in there. And if you start to structure your data into that in general, you're able to use some of these techniques much more quickly. You don't have to re-structure your data in order to get it into that environment. There's a couple of really, really good case studies that I could point you to that we'll talk specifically about how much. But one of my favorites is a colleague of mine up in Freshburg who had looked at a $30 million data warehouse that was being built for the federal government here that they would bring in for $300,000. Those are the kind of numbers that people are going to pay attention to. Great stuff. The next question is, wouldn't it be great to have a re-key containing all of the different models and schemas? If we could keep the vendors from suing us for putting their intellectual property out on the web, it would be a great idea. That's right. So there are collections of these things and Dama does maintain some of these. You'll see a lot of them we collected into our GINBOC, which is, again, a certification that everybody on staff has here in order to do that. We do have to be careful, though. You certainly would be able to do that internal to your organization a lot easier than we can do it externally because some of these things are, in fact, proprietary organizations and they don't want them to be shared. Now, we're happy to sign in NDAs and work within those constructs, but that also means we don't just take whatever they have and put it out on the web. That's just rude behavior. Great. Great. The next question is, would you explain the differences between all in family and two people? Sure. Well, here, NDAs are not huge similar other than the fact that they're really going to have a key to their order data. Because a family is really like a child, if you can imagine it like a view, it's operational. If you're thinking in third normal form, you're taking all these different tables and combining them into a view, right? That's really the main idea there, is that it's a massive amount of data or not, but it can be a million different columns combined. With the two people, I mean, really at its core, it is kind of a two-pole already because it's got this key, and then as soon as you think of these column families and values around that. So I guess the real question is performance-wise, and I tend to think of two-poles as, I don't know, I mean, really bite-sized, more ordered data. The key with column, though, is that they tend to be stored in memory. So the performance in columns tends to be found not necessarily because columns are a faster data structure, but because column databases tend to have a lot more RAM associated with them than the traditional databases. Yeah, that's definitely a point we didn't touch on, but sure. And somebody has been asking about having access to the slides after the session. Just to let you guys know, they will be emailed out to you two business days after the webinar. Some of the materials you have for the recording, any questions that were asked, those will be included as well. But it looks like that's all the questions we have for today. Thanks for all the wonderful questions you guys. Thanks for an awesome presentation, Steven and Peter. Next month we'll have a webinar on metadata strategies, so hopefully we'll be able to join us for that as well. As always, feel free to contact us if you have any questions. And thanks again to Databrissing and Shannon for hosting us. Thanks, everyone, and have a great day. Thanks, Jim. Thank you.