 Hello and welcome. My name is Shannon Kempen. I'm the Chief Digital Manager for DataVersity. We'd like to thank you for joining today's DataVersity Webinar, Data Modeling Fundamentals, sponsored today by CouchPace. It is the latest installment in the monthly series called DataEd Online with Dr. Peter Akin, brought to you in partnership with Data Blueprint. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. If you'd like to chat with us or with each other, we certainly encourage you to do so. Click the chat icon in the upper right-hand corner for that feature. For questions, we will be collecting them by the Q&A in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag DataEd. To answer the most commonly asked questions, as always, we will send a follow-up email to all registrants within two business days, containing links to the slides. And yes, we are recording, and we'll likewise send the link to the recording of this session, as well as any information requested throughout the webinar. And now let me turn it over to Anush for a quick word from our sponsor. Anush, hello and welcome. Thanks, Shannon. I hope you can hear me okay. You're a little quiet. You were louder earlier, but... Okay. How about now? A little better. Yes. Well, thank you, Shannon. I'm pretty excited today. Well, first of all, good morning, good afternoon, and folks who are joining from different continents. Good evening. My name is Anush Sani, and today I'll be spending... I think the agenda is pretty short for me today. So I will go very quickly and kind of share what CouchFace is. I will just have a few slides to share on the data platform, CouchFace Data. And then, depending upon how much time I'm left with, I will talk about data modeling CouchFace with JSON documents. So just to begin, just to set what CouchFace is and where it kind of fits in. So today we have two different, I would say, options when we talk about data repositories. So we have relational database, and I'm sure we're going to have, I mean, a nice session to hear from Peter after mine. And then we have NoSQL databases. So NoSQL database mostly is good for your high-performance data needs. It's a scalable data store. The documents can be stored, and the structure for these documents are pretty flexible. You don't have to worry about normalization of your data. You can just group your documents, group your information in a single doc, and then store it in a single read and write operation. And CouchFace data platform is, I would say, data repository for your modern application. It gives you more than just the NoSQL data store. It has inbuilt cache with persistence and replication, meaning it's a highly available caching. But on top of that, you can also access your data by simple key-value APIs. But then we have made a lot of enhancements on our platform. We have now allowed customers to store JSON documents and define secondary indexes and access those documents using a very well-defined query language we call it NICL. But then we understand that today's modern application needs more than just simple, I would say, index-defined query access pattern. Today, we all know when we go to Amazon kind of applications, we would like to do searches so we can find the product categories very fast. And generally, these problems are solved by having a separate search engine, but we have collaborated, or I would say integrated that search engine to our data platform. We call it full-text search. And all of these are the microservices and I will share what it means when it comes to scaling your cluster in just a minute. On top of full-text search, you can also run your operational data or analytics on your operational data. And then you can build mobile applications as well, natively using couch-based data platform. So this is how the architecture looks. You can deploy your couch-based server, which is on the far right, on-premises or on any cloud. In fact, we have AMIs available for three of the most popular public cloud platforms. So you can just go and kind of either try out through the test drive, which is available. But the point is you can deploy your couch-based server and then build your mobile application with couch-based light, which is an embedded database, which can kind of store subset of the data onto your local devices. So it gives you offline first experiences, which means that if the connectivity is not there or it's intermittent connectivity, you can still work with your applications and data would be synced to the server through sync gateway. So sync gateway will take care of resolving any conflicts and users would get a seamless experience even on the edge devices, right? So you can build smart devices, sorry, smart applications on the web as well as on the mobile. So we hear kind of that NoSQL is kind of a, I would say, easy to scale on demand, right? So couch-based also deliver that functionality, meaning you can expand your cluster as your business needs grows, but we have gone a step further. We understand that scaling out is very important. For today's, I would say, digital businesses who are transforming their businesses with new technologies. What couch-based gives on top of scale out is it gives you an ability to even scale individual services. The services like I mentioned just a minute ago, right? So if you want to scale your search index, capability, you can do so. You can have heterogeneous servers deployed which have different kind of CPUs, RAM, I would say RAM available. So you can scale all these services depending upon how much workload you expect in your business, right? So you don't have to have, let's say, if you are scaling from two to five nodes, not all the services need to be scaled, right? So you can mix and match. And it's a single node architecture for couch-based, meaning every node get all the services by default. All you have to do is when you're installing, you just click a checkbox, and then you decide which of the service need to be enabled on this system, on this host, right? So it's as simple as checking a box. So with that, let's move on into how you can think about modeling the data in these node SQL databases. And couch-based, as I mentioned, gives you ability to store as a key value as well as you can store your documents, right? And the documents are JSON. So as you would expect that in any real world, I would say application, you have your data which is rich in structure. You have attributes and nested structures. Then you have these attributes which have some kind of relationship, like customer has connections, friends, right? So these are one-to-many, many-to-many relationships. And what you expect from the data is that your data would be evolved. You would be updating, deleting your data on a periodic fashion. But even the schema change over time, right? Because in fast-moving world, you can start with one structure, but as we all know that by the time we are done developing our application and ready to deploy, sometimes even the structure changes, right? And then you have to react to those changes pretty fast. In the relational world, you would be writing these rich data structures into the entities, and these entities would be, I would say, related. By some relationship, like customers have some purchase history, have contact billing information and some connections, right? And the data that you write in these entities would be normalized because you don't want redundancy and you would like data to be kind of merged when you write any select queries. You would be using joins to fetch data out of these many different tables to kind of meet the needs of your application. In the notical world, you would be kind of relying on JSON structure. As you all know that JSON is a very, I would say, object-oriented representation of your business data structure in a text format, right? So you have key value and that value could be number, string, Boolean, or it could also be a JSON object in itself. The benefit of using JSON is that it is self-describing. The schema is embedded in the document itself. So you can write this document and you can define any key value, any attributes on fly, right? And you don't have to worry about what would happen if I don't write this field, right? So as long as there's no index defined, there's no problem. You can evolve your documents on fly and that gives the flexibility to developers today and then kind of change their application as the business changes. So this slide is just kind of representing how the data would look in a relational versus non-relational fashion. On the left, you have different entities in normalized data in a normalized fashion. On the right, you have a JSON structure and I think it's self-describing. My last slide is just to kind of give you a quick snapshot on the kind of functions that we do with the relational database. I mean, many of those things are possible with the JSON. There are some minor differences with the relational database. You can call it as it's a general purpose database, meaning once you have normalized your data, you can run your query and you can kind of place and dice data without worrying about, okay, if my business needs changes, do I need to kind of worry about defining the indexes later? Whereas in JSON or in NoSQL database, it is very optimized to your access pattern, right? So you have to think a little bit ahead that how I'm going to access my data, right? So you can define the right indexes. You can kind of optimize your performance based on do you need some composite keys or different indexes. But just want to highlight that in CouchPace, we do have support for Nickel and it has an inside join, meaning not only you can have a nested object stored as a document, but you can also use joins to kind of join multiple documents and still get some of the benefits you expect from the relational database. So with that, I think that's my time. I will hand it over to Peter. And Peter, take it away. Thank you so much for this great presentation and Anuji will be joining us in the Q&A portion of the webinar as well at the end. And now let me introduce to you our speaker for today, Peter Akin. Peter is an internationally recognized data management thought leader. Many of you already know him or have seen him at conferences worldwide. He has more than 30 years of experience and has received many awards for his outstanding contributions to the profession. Peter is also the founding director of Data Blueprint. He has written dozens of articles and 11 books. The most recent is Your Data Strategy. Peter is experienced with more than 500 data management practices in 20 countries and consistently named as a top data management expert. Some of the most important and largest organizations in the world have sought out his and Data Blueprint's expertise. Peter has spent multi-year immersions with groups as diverse as the U.S. Department of Defense, Deutsche Bank, Nokia, Wells Fargo, the Commonwealth of Virginia, and Walmart. And with that, let me turn it over to Peter for today's presentation. Hello and welcome. Thanks, Shannon. And Anuji, thanks for a great little presentation there. It will be useful to talk about the differences between the two when we come back at the top of the hour. I'm going to move a little quickly on here to make up for some time. The topic today is the fundamentals of data management and data modeling in particular. I'm going to start off with a quick data management overview and then move into some motivation on there. We're going to talk about why data is not a well-understood stub structure. And that's, I think, a very good description of it, but we have to figure out what it all means. Then we'll talk about obviously what is data modeling and why is it important, and all the rest of the things that go in there. But the key is that we're trying to develop a shared understanding between humans and systems, and then we'll get to some of the real fundamentals. There are a couple of them here that I want to highlight up front, which is that there's a power in the purpose statement. So most of what we teach students is wrong about there. So unfortunately, you can't trust the universities to actually produce good data modelers out of the classes that they have. And I say that fully as a university professor on this. We'll talk about how data modeling helps people to understand better data-centric thinking, and then data modeling complements other types of engineering and architecture techniques. And then there's some challenges beyond data modeling. And then, as Shannon mentioned at the top of the hour, we'll do about a half hour of Q&A so that you guys can ask whatever questions that you're interested in. So let's get started on this. Again, first topic here is what is data management, and I'm just going to fly through this stuff. Some of you have heard it before. If this is not clear, please ask us at the Q&A. But we're really a very new discipline, and most people have agreed that data management consists of the stuff that goes on between when you acquire the data and when you use the data. And that's pretty much everything. So we've got some definitions that we use. They're not really, really helpful. We haven't agreed on one in the sense that the industry adopts it. But this means we do have some components of data engineering, data storage, and data delivery, which means we have to be governed in those processes. And therefore we need some specialized team skills. The only problem with even this good definition that I created a while ago is that it does not really show the power of data because it does not well depict reusing the data. It's a much more cyclical fashion. So we've come up with a newer diagram where we show data as the central resource that's in there being utilized by all of the knowledge workers in your organizations. And each of those knowledge workers then can provide a decent feedback that allows us to come back in and improve the data all the way around. We liken this a lot in the data world to Maslow's hierarchy of needs. Some of you may remember it. Maslow talked about that if your physiological needs, your food, clothing, and shelter needs were unmet, you couldn't be safe. If you weren't safe, you'll never get to the status of being loved or belonging to something that's larger than you. Therefore self-esteem suffers in that general context, and we can't get to the self-actualization. All of that, of course, says that data is exactly the same way. There's this wonderful stuff that everybody talks about all the time. Data management practice is MDM. If I were going to redo this diagram, I would add the words Bitcoin and blockchain to that yellow diagonal. But that is just the tip of the iceberg, and that's really the key to this. Those things that we're talking about there are really largely advanced practices that we need to talk about specifically are foundational practices. Because the things that everybody is advertising these days are technologies as opposed to foundational pieces. What we really want to talk about today are capabilities. Now, everybody asks me all the time, can you do this faster? And the answer is yes, if I work faster, it will take longer, it will cost more, it will deliver less and present greater risk than if we proceed the way we're trying to proceed. Now, you'll notice those five base foundational pieces flew over into this chart here. What we're looking at now is that we have a good structure where we talk specifically about the ability to manage data as a coherent asset that we now have professionals that we can call data management asset professionals. We can maintain the fit for purpose sense of this, and if we do it with the right technology and the right prophecies, we'll probably produce some results in there. There are ways of evaluating each of these. It's a one-to-five scale. This is called the DMM, and it's done by Carnegie Mellon University, the CMMI Institute. And here in our colleague, Melanie Mecca, will be on its April. So we're going to be on next month. We'll do this again on the DMM side of things. Again, just a very brief analogy of how it works, though. There is a weak link in the chain sort of aspect of this, and that is that if all of these are performing at level three, which we'll all agree is a level we should aspire to, but that you don't have one of these things in place, well, the foundation is only as strong as the weakest link. So it's a very good metaphor. And what it says is that if you put a million or a billion dollars into data quality, it won't help your data management efforts because you're lacking one other essential piece on there. And I'm just using that as an example. Again, we can come back and we will talk about that for the full hour next month on this. The data management body of knowledge, of course, acknowledges data development as an important aspect of it right here. We're going to sort of talk about the blending between data architecture management and data development here. The DIMBAK does give us one overview here. I'm not going to read you every word on this page, but it's a really good page to have because the key is there are things that go into it. There are activities that we do and there are primary deliverables that come out of this. And these are rather good lists, so we'd like you to make sure that you have them accessible because architectures and models are here whether you'd like them to be or not. All organizations have these models, these architectures that are represented by the models, but some are better understood and documented than others, and that's the key. If we can have this documentation, well, let's put it this way, take the opposite approach. If they're not understood and therefore documented, they can't be useful to the organization. So we want to do a little better than that. Architecture and models kind of meet in the middle. Architecture gives you this high level of abstraction where your understanding and the integration is all focused. The models are more downward facing where they're talking about specific implementation details, and models are literally the translation between systems and people as well as between architecture and implementation. So that's a very good spot for all of these things to come forward and come together. Let's dive a little further now and get out of the motivation section and talk specifically. Models are all about things that somebody cares to keep information about. The first thing that we recognize is that they are entities. Those are the person's places or things, the nouns, if you will, in our IT development. And then we also talk about attributes. These are characteristics of an entity. They include descriptive things, color, size, sequence, et cetera, et cetera. And then finally, we get to the relationships between those various things. This is how we do all of the sharing that we're trying to do. Again, I'll give you an example here. An order is placed by one and only one customer, if that's the case. Well, that may be a business rule, but that is enforced by these relationships built on the structure of specific entities and attributes. And the reason we know this is because we've had some good education here, but the vast majority of knowledge workers are taught absolutely nothing about data. And of course, that is the definition of a knowledge worker, somebody who works in this field. When we teach IT professionals about it, it's even worse though. We teach them how to build a brand new database, and this gives the impressions from an IT perspective that only IT professionals with data experience are needed when you are doing building new databases. There's a lot more things that we do. That's one of the things I hope you take away from this particular power here. The real key for all of this is that unlike some major failures here, and I'm going to show you a major failure in the real world in just a quick second here, data are really more problematic. So here's the Tacoma narrow bridge from those of you that are in the Pacific Northwest, fairly famous bridge. It was opened on July 1st of 1940, and it collapsed on November 7th. So it didn't last very long. Called the most dramatic failure in bridge history. And the nice part was the engineers actually did take a look at this and said, wow, we built that bridge incorrectly, unsafely. There's a car in the middle of that bridge as it's wobbling here. This is a windstorm for those of you that haven't seen the video here before. And just so that you know, the video here was taken by the insurance company that was on the hook for paying for the bridge if and when it collapsed. So that is an issue. Now the reason I'm showing you this is because unlike bridge failures, when you have a data failure, it's simply not as well known. They usually blame the system and say, there's a problem. And I contend that data failures cost organizations minimally 20 to 40% of their IT budget. But they do it in a different fashion than a bridge failure. Instead, it's little tiny cuts here, here, here, and here repeated thousands, millions. I have one customer that I've worked with that had a billion queries a day to a database. Well, this is death by a thousand cuts because maltreated data costs a lot of money. If you have to think about it again, think about the opposite question. Were your systems explicitly designed to be integrated or otherwise worked together? Were there anything other than an integrated software package or an ERP? The answer is they probably weren't designed to work together from the bottom up. And consequently, there's a problem. So what's the likelihood that they would happen to work together? Well, good luck on that. So there's our 20 to 40% figure saying that data conversion, improvement, evolution are costing lots of organizations lots of money. And this leads to my bad data decision spirals where we have business decision makers who are not data knowledgeable, making both of them bad decisions, leading to poor treatment of organizational data and assets, poor quality data, and of course, poor organizational outcomes. We need to break out of that. We're going to try to work on that by talking about how data models can contribute to it. Again, for what's going on in the world today, one of the things I love to do is this is an infographic put out by Domo each year where they talk about what happened for the entirety of 2017. Every minute of every day, we streamed 70,000 hours of Netflix. That's a pretty intense calculation. Again, half a million tweets, a million Tinder swipes, whatever it is you're looking for, there will never be any less data than right now. And consequently, we need to pay attention to it. But because data is automated, we also then need to have engineering in context here. So here's a wonderful piece of equipment that you guys can see if you're going to join us next month at Enterprise Data World in San Diego. It's on the USS Midway. This thing is tall. It has a clutch. It was built in 1942 and kept 4,000 sailors with breakfast, lunch, and dinner for the duration of World War II. They had to build something. It had to work. They weren't going to be able, when they got out in the middle of the ocean, to say, oh, send us another one of those thingies. And I don't care how many of these kitchen aids you put together, they are wonderful machines. But you can't have a duty cycle of 4,000 people using that machine to make pancakes in order to do this. Again, that's the engineering side of it. From the architecture side, you can't architect after your implementation. By the way, I have a lot of requests for this particular piece. It's an old BMW commercial. Send me an email. I'll be glad to put it in a link for you and send it to you. The key is that most people look at this and say, oh my goodness, do we have a good architectural foundation here? And they look at this house and go, eh, I'm not sure. Well, it depends on what you're trying to do. If I'm just trying to add a dish network dish to the top of this house, your IT people are pretty good at saying, yes, that will be perfectly okay to do. However, if they don't have a good foundation, then you shouldn't do anything else, and these things should be said, unsuitable for further investment in here. So let's dive in again with the modeling. What is modeling? Well, we do analysis and design to determine and understand the data requirements for whatever our customers are trying to do. And also it means to design the data structure. So it's the model itself and the modeling process that gives it to it. So the model are these set of data specifications. And you can see already that we've started to talk about the aspects of it. It's, again, entities, the attributes that make up an entity, and how those things are grouped or arranged together in order to come up with this. If you look in the right-hand side of this diagram, you'll see my logo at Data Blueprints here. We think that's a pretty good one because that's what we do. So modeling is this process trying to figure out what do we do to make a solution for people that doesn't compromise the integrity or the security of the data. Good data models communicate the requirements and the quality of the solution design. And we're guided by two formulas. There's a purpose plus the audience equals the deliverables. Plus the deliverables plus the resources plus time equals the approach that we take to doing data modeling on this. Data models will help facilitate formalization. They produce a single precise definition of the data requirements. And the data-related rules that govern this. We tend to call these business rules, but that's only because we tend to be sitting in IT when we're doing this. Again, our guidance here is that these things should be done in the business and they should not be done in IT. Consequently, you have a big challenge of making sure that you actually talk about the way things are happening instead of this funny language that we use to talk about business rules. We also talk about specific communication within there. And again, it bridges the understanding between people and the systems because literally what you put into the data model becomes the actual database itself. We can also do some things that help organizations understand if they're going to evolve an existing structure or training exercises and things like that. A lot of people are actually getting these skills in order to do this. Finally, it helps us with scope. If it's not in the data model, it's not going to happen. Now, I'm not going to spend a fair amount of time talking about the anti-spark model here, but just know and understand what generally data models are done for one of three basic purposes. The first one is to just give somebody a conceptual view of what it looks like going on. The second one is a logical view, which tells us a little bit more detail about this, how it's actually going to be relatable, becomes the plan for taking that thing that's conceptual and implementing it in something physical. The physical model is how it looks in Oracle DB2 teradata or Hadoop, depending on what you're doing with the whole process. Various notations in data modeling are challenging for us because we have at least four variants that we use. There's a Chen style, a Bachman style, a Martin style, and the one we use, which is called information engineering. Each of these are showing the basic, tame type of documents that are, excuse me, same type of relationships that are talking about it from this perspective. The key is do not get yourself involved in an argument over these. Get one of these styles and move on. It is not worth arguing the fundamental differences. There are some methods that are not capable of supporting concepts in there, but any one of these methods will work just fine on that. Again, just pick one. We're going to pick information engineering, and let's take a look. What's a relationship here? Well, a relationship is a natural association between two or more entities. I haven't even enabled the entities that are being associated here, so it's a little hard to figure out what goes on, but let's put some things on here. We talk about these as ordinality or cardinality. Again, dependencies and the Max and Min types of things that are on there. We look at this and we say an order is placed by one and only one customer. That means there's two customers who want to place one order. That requires special processing, and we simply can't handle it with our automation. No problem with that from a business perspective. If you only get one a month, if you get one an hour, you may want to change your model and it'll allow you. Here, an order contains at least one or more products. You can't have an order without a product, and you can't have a product without an order, but we don't like to do that. A customer can place zero orders or one or more orders. There's our definition of what a customer. A customer is somebody who could place an order with us, as opposed to somebody who has placed an order with us. Those of you that have been in this business understand these very subtle differences that I'm emphasizing here. A product is contained on zero or more orders. Again, this is just a high-level description of what those relationships are. Let's take a look at what would the proper relationship be between a member and a club? The answer is generally, although it depends on the circumstances, a member can be part of many clubs, and a club can have many members. We tend not to say you have to be exclusive or anything like that in there. That's our relationships aspect of it. Obviously, entities are the member and the club, business things about which we're going to create, read, update, or delete information about. Now we're going to talk about what's an attribute. An attribute definition here is that attributes describe an entity and the instances of those things. In our definition of club, we have some sort of ID for the club. We have a current promotion, whatever that happens to be, a period of obligation, a number canceled, a year to date, a number of members in the total units sold by the club. These don't mean anything outside of the context of what we're operating under, but again, for this, it's going to be important. So let's put it all together then in a data structure. I'm going to show you these data structures without the attributes on them just to get us started. But we can have a business rule here that says one employee can be associated with one person, and you say to yourself, well, that makes sense. A person is either an employee of our organization or not, and it seems natural. However, if the concept of moonlighting is required, as in somebody may have a second job within the organization, then that might be important. Now, most people say, well, that's no big deal, but when I created this particular data model, I was working for the Defense Department, and 30% of the Defense Department held a second job during those years. So reprocessing 30% of your payroll every week probably did not make a whole lot of sense. So we can go to data modeling and say, let's get a better view of this. Let me show you one other example on the same page. One employee can be associated with a position. Notice this exact same structure between person and employee that it is between employee and position. This is the concept of job sharing, but if somebody wants to work two days a week and somebody else three, this system cannot support it as it is structured. So let's change the structure just a bit. And all I'm doing here is changing the data structure to say a person can be related to zero employees, but a person could also be many employees. Similarly, an employee, excuse me, a position can be filled by one employee, zero employees for, again, many employees. And if that's the case, that gives us more flexible data structures. So I'm going to put the more flexible one on the left, the less flexible one on the right here, and simply say that, look, the one on the right requires two more structural loops if you're going to process all of that data in one fell swoop, leading us to discover that these data structures must be specified prior to software development and acquisition. Unfortunately, we don't teach that. And consequently, most people try to do this in the middle of a project as opposed to prior to developing the software. Again, our understanding that we're trying to get to here is to make sure that everybody understands the models from an architectural perspective. It's a digital blueprint. Again, plug for data blueprint there, right? I did name the company, right, didn't I? Again, gets us that shared understanding of people and employees. So let's talk a little bit about how you do the models. There's a five-step process that's not horribly difficult to look at. We identify the employees, excuse me, the employees, the entities. We identify the entities. So there they are. I'm not going to label these guys. But let's put names on them so that they're all different. All right, so I've got some names that are out there. And now let's identify a key for each of those entities. An entity is a stack of things about which we create, read, update, and deleting information, clubs, members, as we talked about earlier. You need a key to be able to find the exact instance that you're looking for. If you don't have the ability to find the exact instance, it's kind of useful. And then draw a rough draft of the entity relationship diagram. So these things are related to those things in these ways. Again, I'm being very high-level here. Our fourth step then is to identify the data attributes to write down everything that we're supposed to be keeping track of in the organization. Terrific way of doing it. We need to then assign them. So these go here, these go here, these go here, whatever it is we map the attributes to each entity. Now, the model evolution then should be good. I mean, the idea is this is the first draft. One of my favorite quotes is one by George Box that says, all models are wrong, but some models are useful. If this model provokes additional discussion and now you want to evolve the model a little bit, no problem at all. So I changed the model a little bit. It didn't actually change the structure. But when I moved it a touch, you can see it actually presents a much easier to read structure in there when we're looking at it. It's okay to change the models at first, but if your changes are not becoming fewer and further between and not going out further, then you're off on the wrong track. Again, from an evolutionary perspective, we may also discover another association that we want to add to this model because we've been doing more analysis work on it. The relative amount of time that you spend as you're going through and doing the data modeling should evolve in this fashion. There are a couple of different activities we look at, collecting evidence and analyzing it. We should start off with more collection and eventually reduce to much more collection, but we do analysis in order to do that. The second aspect that changes over time is that the coordination requirements change. We almost always need to go and access a group of people that we call SMEs. It's not a very nice thing to call somebody, but when you understand that a SME is a subject matter expert, they kind of like that. So we need to do some coordination at first, but hopefully those will decline over time, allowing us to put increasing amounts of time into our various analyses that we're doing. Finally, one aspect of modeling that is not taught and not done for the most part, and that is what we do between refining the model and validating the model. When we refine the model, we're still thinking about what it should be. When we validate the model, though, we present it to a bunch of people and we pay them to find mistakes. This is a little bit tough, but we want people to find mistakes. Now this comes out of the field of software testing. If you tell people to test the software and they say, oh, it's fine, they haven't done anything, but if you pay them to find errors, they will find errors. Of course, we want to do them exactly the same way. Now I'm not saying, what, money out of your back pocket, but keep candy on the table, cookies, doing what you're doing. On the other hand, most people, when they're invited to a data modeling session, say, oh, I've got to go visit my dentist that day, because I'd rather do that than go do the data modeling cycles. Don't tell people that you're going to do data modeling. Just write some stuff down, then arrange it, and then make some appropriate connections between the things that you've written down. So that's what we mean by data modeling. Again, a very quick, high-level piece. There are absolutely fundamental pieces of this that you've got to get right, though. And again, we teach that much of it in college and universities. What we don't do is talk about these next aspects that I'm going to dive into. So let's dive into, first of all, the power of the purpose statement. Now this was something that Clive Finkelstein taught me many, many years ago. Models are developed in response to an organizational need. The organizational need, maybe I don't understand what happens in that system. A data model is a great way of explaining why certain things are done in certain fashions. I was with another group recently, and they were working on something, and I apparently opened a 10-year-old wound because the guy who'd built the database was there with the user. And we said something like, well, wouldn't it be easier to track people this way? And he said, you know, we talked about that 10 years ago. I told you, when I made this model, you could do one or the other, but not both. And everybody went, wow, yeah, you're right. About 10 years later, he got to say, I told you so on this. But the organizational needs are going to vary, and they're going to be instantiated into some form of a data model. You may be building a new system, but you may also be trying to understand an existing system in there. That then allows us to see what was actually implemented by the IT people. And of course, we need a feedback loop in there in order to make it work all the way around. The problem is that most people think, we're done. We're not. Organizations should be evolving continuously. Not that their data models are going to continue to model, but it is something that you need to pay attention to. So these definitional pieces are the first mistake most people make when they're doing data models. When you look at a definition, if I come up here and say I want a definition for a bed, well, the definition of a bed is something you sleep in. Now, that's actually a very poor definition because a sleeping bag would fall into that. Your car would fall into that under certain circumstances, but nevertheless, we'll just assume the definition is there. Now, let's take a look a little further at this. I go in and talk to people, and again, I joke about locking people in a room, but the reason we're locked in this room is to write down a mission statement for the modeling exercise. Absolutely critical. At the top of every page, you want to say, we're trying to understand the formal relationship between a soda and a customer, okay? So that when we can walk out the door with this relationship in mind, we'll be in really good shape. By the way, sodas don't have a one-way relationship between customers. Sodas are given to customers, and customers select sodas. So we may have a good understanding there, but most importantly, when we're trying to understand the relationship between the soda and the customer, coins may or may not play a role in there. Again, your Apple Pay may or may not play a role depending on what we're doing. So this gets to the scope issues that we talked about as well. Oh, yeah, selects and pays for a customer. Now we've got it, and we can get it from there as well. Here's another one. Mission is to understand the characteristics that differ between our hospital beds. So here's a hospital bed here, and when we walk out the door after this modeling session, we'll identify the top tree traits that represent the brand of the hospital beds that we have here. So what we've got on this little example is that the entity bed is a substructure within a room, within the substructure of the facility location. It contains information about beds within rooms. Cool, no problem there. Let's take a look a little further. Remember the job sharing question in there? Well, let's go down a little further on that and say job sharing. Could our systems handle the following business rule tomorrow? What if I wanted to start job sharing? Well, if the rule is an employee has exactly one position, we probably won't be able to do that. Oh, I'm sorry, I'm showing there with an arrow, so it does actually have multiple positions, so an employee does have the ability to do multiple positions and can be filled by zero or one positions on that. I did that badly. Let me just make sure I... An employee, this is our understanding of the existing system. An employee has exactly one job and a position can be filled by zero or one employee. So if we want multiple employees, it can happen with the existing system here. All of this leads to an understanding. If you're just doing definitions, you're missing out on the ability to add a lot of contextual information to your models. So here's a real example that we did from the Veterans Administration System. Yes, that big old system that they've got out there. We put in place a purpose statement describing why the organization is maintaining that concept. We've got the sources of information. We've got the partial list of attributes that are in there and associations with other attributes. Now, I'm going to tell you that because this was a real live example. We had a data model that had this data structure in it as they were planning the iteration of the veteran system that we're currently using at the moment. And it's a substructure within a room, within a facility location. Somebody said, hey, yeah, we're going to put RFIDs on the bed so we can track them because hospitals occasionally lose patience. I know that sounds bizarre, but they do. And that would be a problem. Well, here's the issue. If I put an RFID on the bed, then what does the hallway have to become? And the answer is a room. If a hallway is the same thing as a room, we have a definitional problem because when I put somebody in the hallway, I don't know whether they're at this end of the hallway or the other end of the hallway on this, or an elevator, which is another interesting place where they were losing people. Again, don't let them define it. Give them a purpose statement, much more powerful. These data models, I'm going to give you a couple of quick examples here now, define the wave vocabulary. So this organization here that used this particular model said this is how we're going to describe account subscribers, charges, and bills. Again, very, very nice articulation. It's something that everybody had and they all understand. So they're all literally speaking the same language. Here's another model. And if you look at this one, this one talks about official vocabulary when using fuel, customers, auto, et cetera, et cetera. And it tells us a couple of interesting things on here. First of all, we can tell just looking at this model that it's a car rental company and that the rental agreement is clearly central to everything else, not because it's in the center, but because it's connected to more things. There is no direct connection between customers and contacts. That's kind of an interesting description on that. A contract must have a customer and nothing structural prevents an auto from being rented to multiple customers in there. And the phone units are actually tied to the rental agreements as well as to the repair history there. So we have a fair amount of information we can get off of that chart. Here's a third example here. Again, a lot of different things. They're clearly doing sales commission-based pricing. It's difficult to change a customer address because a customer address is part of an order. I know we're going to have to go back and find all the orders for them if I'm going to do that. You could implement variable pricing, but it'd be very difficult to implement standard pricing given that salesperson information is not tied directly to the order. The prices are not included in the catalog. And we might ask a question here. Do salespeople sell things that are shipped quickly so they get their commissions quicker? There's nothing that prohibits a sale from having multiple salespeople attached to it. Multiple invoices are allowed for a single order on this, and partial shipment is allowed. These are all things we can tell from the database. So both forward and reverse engineering are going to be very, very useful here. One more model here. This is another part of the VA system here, and I want to illustrate this one because this was another interesting concept here. They're looking, in this case, at a model view. So a subsection of the overall database. In this case, concerning disposition of patients. Now the disposition is related here. There's admission and discharge, and you can see we have a one-to-one. Every admission must have a discharge. Every discharge must have an admission. If you don't have them, they can't work. The database doesn't have the structural integrity that we'd like it to have. Let's put that up here now in context of the examples here. You can see there's some business rules that we've written into play there, and I'm going to just change one of them on you here. There's the first one. Admission is associated with one and only one discharge. That's the thing I showed you before. And of course discharge, excuse me, there's the definition of admission. It contains information about the patient episodes in there. No, that's the definition, by the way. These are not purpose statements. And discharge is the table of codes having the disposition types that are available. So one of the things we pointed out to the group that was implementing this was death must be a disposition code. Oh my goodness, what a dour thing. Right here we're on a bright sunny afternoon and we're talking about death. Well, we said the same thing to the people who were building this. Do you really want the admission to say discharge cause was death? So they went back and rethought about that a little bit and said maybe that's not really the best way to think about this sort of thing. Einstein, one of my favorite quotes, significant problems we face cannot be solved at the same level of thinking when we were at when we created them in here. And this sort of tells us a little bit about what's going on. Most organizations start off with strategy and they say let's have some IT projects and let's create some data around them. This is so wrong, I just can't begin to tell you and again, won't try because I've written books on it. But we need to flip it. We need to flip it completely over and say you need to do the data and information first and then discover the IT projects after that as a subsequent piece. We've done a little bit of work out there. There's a website called thedatadoctrine.com and if you look, these are four tenets that we have put up that say what it means to do data centric first. We did it just like the agile process. It's not that we don't like the things on the right. It's just that we value the things on the left a little bit more. Just a tiny quick plug in there to do that but again, we are looking for people that want to engage in a dialogue to help us codify that a little bit more. Second Einstein quote, everything should be made as simple as possible and as simpler. So let's look at what organizations manage from an architectural perspective process, architecture system, architecture business, security, technical data, of course, and information. We want to have all those. What we also need to have though is management, not thinking you're just sitting around in a committee meeting, eating cookies, and not actually getting any really work done. Various ways of making sure this works. It is absolutely crucial to make sure that you've got things that are actually working for you. The modeling in these contexts tends to look a lot like this. We do forward engineering where we say we have created a new thing from something that didn't exist and we start off on the left-hand side with our conceptual, our logical and our physical models as we move forward. That's what we talk about forward engineering and many students do get that particular beat. However, most people don't realize that since we spend 80% of our dollars on fixing things that already exist in IT instead of building new ones, that the idea that we're going to be building lots of new things is really, really bad. We need to understand that these other options here, I'm going to show all nine of them on this chart, represent different aspects of ways the data model being used to provide information about the system. It's a reverse engineering process. It's much more useful than we would typically do in organizations, certainly much broader than we give students the idea that this is easy to use. Again, I'm not going to spend a lot of time on this, but of course all this involves metadata and the models that are associated with it. I give you just these layouts on the side here so that you can take a look at them, get your leisure and come back or ask a question at the break when we get around to it. All of our models allow us to work within this model evolution framework that we developed a couple of years ago. Again, our conceptual, logical and physical, we also have a division of as is or to be. As is means it already exists. To be means it's something that we don't yet have. We'd like to get there. There's one more dimension again that's not incorporated, typically as I mentioned before, whether the model is validated or not. So we can now say every change we are trying to do in our data modeling environment can be mapped between one of the cylinders on this particular exercise. One other thing that we don't tell people about models is that when we are starting off with our technology dependent model and moving to our logicals, so it's our reverse engineering that we're doing in here, it tends to look like this. I start there, I go up here, we teach them how to do an as is to to be and then we reconnect. Well, the problem with that is it ignores the main reason we do this. The main reason we do it is that we start off here with a physical as is model, move to a logical as is model, but then we're doing something. We're changing the architecture. We're adding new components. We're integrating it in a different fashion, something along those lines. And I'm going to change that green square to a green and orange square because we've now modified it. And only then do we move forward. This is another whole hour long lecture. I will not bore you guys with the stuff that goes into there, but if you're interested there's lots of material online. We can get to that as well. One last Einstein quote here just to break up is just always form the chief interest of all technical endeavors. There is a root good cause in this. Never forget this in the midst of your diagrams and equations. So Dr. Einstein and models are used to support strategy. And this is a really important aspect. Most organizations don't see that there's a link there. I'll give you a very specific one from this. Of course, I'm taking it from one of Clive's books here, but it actually occurred to me. I was a retail clothing store manager in the 1970s. If any of you remember those days, we had gasoline rationing. And they told all of us in the store as managers that we now had to be sales. But the interesting thing was there was no way. You were classified as an employee, you were either a salesperson or you were a manager, but you couldn't be both. And only salespeople had an attribute where they could accumulate sales. They had many provisions to accumulate sales for managers. So again, these are incredibly important ideas that are just understood not well. Remember, if our systems weren't explicitly designed to work together, the chances that they will just happen to work together are slim and none. Data modeling helps to ensure the interoperability of organizations. A software engineering effort is going to be focused on a program. A database effort is going to be focused on a family, if you will, of programs that are working in there. But who keeps track of what's going on in the yellow database and the orange database? And the answer is usually nobody. So when we want to have this greater potential value, it means we have to have organization-wide sets of models so that the decisions about the range and usage of data are made in appropriate knowledge. So the analysis is on the use of the data to support a process, and the laws by data interface or exchange problems can be diagnosed easily, and the goals connect both at the operational and strategic level. One data model is ideal. We're unlikely to get to that particular piece. So we're getting back towards the top of the hour, and I want to finish with just a couple of ways of describing this. We use information. We use models to store and formalize information. We filter out extraneous detail, the detail that we have in these models is very, very complex, and we have models with thousands and thousands of attributes in there. I'm showing the little Easter Island thing on the right here, because that is a really good example. They've been trying to figure out how they moved those things around, and one of the things they did was make a series of models to say, hey, could they have done this? I don't know that they have the actual right answer on that, but they're closer than if they didn't have a model in order to take a look at it. Models define what we call an essential set of information, and this essential set of information is, again, it's essential. It's a great word to describe this. If I don't have it, it's not there. I have an exercise I do with my students where I talk about my laser pointer and what are the essential attributes of my laser pointer is that it also changes slides for me. I know that may seem trivial, but if you do 100 presentations a year, you do not want to have to have two things in your hands when you could have one thing in your hand. It's essential information. You want to understand the complex behavior of our systems. You want to gain information from the process of developing and interacting with the model. So putting the model out there in the first place by itself doesn't do anything. Then we interact with the model. Now we start to learn a little bit more. You can do what if scenarios when you have these types of models and talk about systems responses when you're doing various different types of conditions. For business values, the goal is a shared understanding between IT and the business. If you have no disagreements, you have insufficient communication. Hi, dear, how was your day? Fine, how was yours? There is no information transferring back and forth. However, if I say hi, dear, how was your day? And she says, okay, you need to sit down for a bit because I had a rough one. I'll say pour me a drink and let's talk about it. Data exchange and sharing that's where we have to have these things engineered. And if we do the wrong data model, we will do the wrong things faster. Absolutely critical to have a good understanding of data modeling basics in order to build advanced data technologies. The modeling characteristics change over the course of the analysis. Again, at the beginning, you're going to do one type of activity. At the end, you're doing another type of activity. If you are not incentivizing people to point out the errors, you won't find any errors to point out to you. Nobody likes to be told their baby is ugly, but sometimes it is actually the case that your data model is wrong. You need to be incorporating these motivational statements in there. Again, the purpose statements. Because if we have these purpose statements, it tells us so much more than if we simply have the definitions of bed as the place to sleep in. No, that's not right. The bed is the thing we're going to use to track where the patient is. It's not going to make any sense for all the hallways into rooms and all the elevators into rooms. It's not going to make any sense. The use of modeling itself is so much more important than the specific modeling method selected. Again, we tend to standardize on information engineering, but there's a lot of other people that read other types of models. It takes you about as much time it takes to get from an iPhone 8 to an iPhone 10. These models, however, are living documents. They should easily adapt to change. It's not that they should change once they're in production, but as you're going through the process of understanding all of this, very, very critical. They give you access to metadata about them. That's what I mean by search technologies. You don't want to have a data model that has stuff in it. By the way, there's some really incredible ways to put them out there on the web, and other people can read them without actually having to buy the data modeling package. Very, very nice given that sort of situation. Again, the key is the utility. I've had people criticize me because we put clip art and color on our diagrams. No. If it helps users understand what's going on, make it more useful. There's a drinking story I'll tell somebody about that if they want to catch me on the side. Just remember to ask me about Bunny Smith. This is the final slide here as we get towards the top of the hour. Why do modeling in the first place? Would you build a house without an architectural sketch? I hope the answer is no. The model is the sketch of the system to be built by the project. Would you like to have an estimate on how long or how much of your new house is going to cost? Well, the model tells you that you're going to build it out of seashells. Gosh, that's going to take a little longer than that, right? Nothing wrong with seashells, but it's going to factor then. If you hired a set of contractors from all over the world, would you like them to speak the same language? The model is the common language for the team, and the data model is the common language, the common vocabulary for your project. Would you like to verify the proposals of the construction team before the work gets started? There are tons of hours of implementation work in there to get them done. Very, very critical to have that. If it was a great house, would you like to build it again? Oh, yeah, I think I'd like to build it again. Well, how are you going to build it if you don't have the models? Again, think about it very least as minimal documentation in order to pull all of that together. And finally, and this is why we spent so much time doing reverse engineering on this, if you were going to take a drill into your house, would you just randomly drill into the wall? And the answer is no, because what could be behind that wall? Well, it could be an HVAC system, heating and air conditioning. It could be plumbing, where you're taking bad stuff out of the house quickly. It could be an electrical component, electricity and drills, probably not the best thing. Worse still, you could drill into a supporting wall. Now, the supporting wall is kind of an interesting concept, because it gets us to something we call referential integrity in the data models. If the data models do not support referential integrity, it's very hard to have data organizing correctly. Again, if you have a disposition code in a hospital scenario and no corresponding admission code, somebody might wonder, why are all these people being disposed of, I guess, is the right word for it? Where are they coming from? How are they getting admitted to the system? Now, again, we want to ask that question. So, again, the models are going to be the system that allows everybody to come up in the project and understand this. These data models are so critical, and yet so many people we do such a bad job, and I don't mean to pick on academics, but I am one as well as a consultant and all the rest of the things that I do. It is just absolutely abysmal to look at this stuff, because we grade the kids on whether they've gotten the syntax of the project right rather than whether they're trying to, in fact, actually communicate with people and get the stuff done exactly as it's supposed to be done. So, we're at the top of the hour now. It just blow through to a couple of events later this month. Hopefully everybody will be gathering in San Diego. The session I'm going to be running is called The First Year as a Chief Data Officer, and it seems like that's the only year that there's a really interesting thing happening over there. I think that's 1 p.m. eastern time. I think it's 4 p.m. west coast time. We'll make that. And then our May webinar, as I mentioned, we're going to be talking about how to understand data maturity, and I'll be doing that with my colleague, Melanie Mecca, who is just a wonderful person as well. Then in June, we're going to be at the data governance conference where we'll be doing the data governance strategies that are there. On keeping the data momentum going in your project. And Shannon, with that, we're at the top of the hour, and we're ready for some questions. Peter, thank you so much for another great presentation. And just to dive in and answer the most commonly asked questions, just a reminder, I will be sending a follow-up email to all registrants by end of day Thursday with links to the slides and links to the recording of the session. And anything else requested throughout, we did have a request here for the BMW ad. I actually had several requests. So we can get that up to everybody. So diving right in, this question actually came in right at the end of Anusha's presentation, Anusha. Maybe you want to dive in first here. How do you preserve data integrity if the schema can be changed without any data rule enforcement? Yeah, I mean, that's a great question I was trying to type. That's okay. The button was not unable. I think the way I would like to answer this question is that, well back, I would say a decade back when we had just very few relational database options, right? Oracle, DB2, and a couple others, right? So things have changed and now we have, I would say more than 100. Someone reminded me it was maybe above 200, no SQL database options, right? So you have relational and you have non-relational database options, right? So when, as a developer, you are designing an application, the question you need to ask is, is the data integrity important? Am I writing an application which is transactional in nature? Or I am writing or building an application which is mostly interaction in nature. And I'll give you an example. So when you go to, let's say, orbit or hot wire and you are trying to book some flight tickets or the whole package for your upcoming application, right? You go and try out many different options, different dates, different prices, went to fly, which flight to fly with, right? So that is the interaction you are making, right? And from one of our customers who kind of deal in this hospitality business, they gave us some statistics that in average or on average customers spend 1,000 interactions or more before they do a transaction. And the transaction here is when they buy and book a ticket, right? So everything before is the interaction with the system. So when you are dealing with that kind of scale, that kind of performance requirements, the NoSQL is the right, I would say, the engine for those kind of interaction needs. And when you are building those applications where you are trying to improve the engagement with your customers, I think what matters most is, as I said, is the latency. That these interactions should be so fast that your customers don't spend too much time waiting for the pace to refresh, right? And in those scenarios, data integrity is not that important in e-commerce. And then hopefully that will kind of describe what I'm trying to tell. In e-commerce, when we go to, let's say, any of these Amazons or Overstock or eBay or whatnot, when we do a search on any product, how much information do we get to make that, I would say, experience better or the interaction better, right? So we get recommendations, we get different pricing, and you know what, all of that information, and of course, at the bottom, you also see customers also viewed products like this, right? So all of that information is not coming from the relational database system. They are coming from massively, I would say, scale database clusters, which are of no equal type, right? So they are trying to solve a different set of problems. Data integrity is not so important in those, I would say, interaction workloads. What is more important is the latency and the performance, right? So you just have to think and pick the right technology. Does the relational database, which I said kind of keeps the integrity in place, but that's the transactional in nature and it has its own pros and cons. In NoSQL, yes, you kind of deal with the denormalized data, but what you get is the scalability, ease of use, and as I mentioned, Couchface gives you the whole data platform, right? So all of those flexibilities, you just have to think and factor in. And the news you've done a great job of describing the sort of two paradigms that we're faced with. As you mentioned also, there are hundreds and hundreds of variants of all of these. The key for you all is to understand that the right tool is going to help you. And I think that's what the news was working towards. There are certain applications where one tool will be better than the other, just the same way as if I'm going to dig a grave and put a body in it, I probably don't want to do it with a spoon. Again, you guys are going to be thinking my mind is too much on death today, but it just seems to be the metaphor that's out there. So yeah, you don't want to go dig a grave with a spoon, right? You want to dig a grave with a shovel, and you want to do that in a way that makes perfectly good sense. Each of these database technologies allows you to make those trade-offs, and this is what the trade-off is. There are certain characteristics that you want to have from your data processing that can evolve even during the course of the exercise. So you might start off with a not only sequel database doing the first part of the exercise and then firm it up for production by turning it into something more normal, possibly a hierarchical or network database as well as looking at relational databases. The problem is that we don't teach any of this in college and university. We teach people that it's only relational or only no sequel. There are lots of variants in between, and it's just a shame that we don't have the ability to get more people educated on this because we're teaching the wrong stuff in college and universities. End of soapbox. Go ahead, Anush. I was just saying that perfect, the one word that comes to my mind is purpose, right? So we have to think about the purpose of this application I'm building, right? And then pick the right tool. And that leads back into the discussion there because if you have a definition statement in there, it's going to give you the definition of the bed. But if you say the purpose of the bed is put dogs in it, you know you're not talking about people. So there's so much richer information that you can get with the purpose. Not exactly what Anush is guiding you guys on as well. Understand the purpose that you're trying to do, and that will help to eliminate certain classes of tools that you don't want to consider. What's left over must be those things that you should consider. So diving, again, moving forward with the questions here. You know, when an org uses software as a service or COTS, there is no data modeling per se, correct? In this case, how do you devern it? I'm sorry, would you repeat that one? Lots of acronyms here. So when an org is using software as a service, there's no data modeling per se, correct? In that case, how do you govern it? Okay. Well, I hope that you got from this discussion here that the tools that Anush and I are talking about can be used both in developing the software. So it's highly unlikely that software as a service, if it has a database component, didn't use a data model to create that. Okay. And the question was specifically about governance, I'll give you that in a second. But in order for you to understand what they're doing internally, that should be part of the documentation that they provide you when you're evaluating the service. So again, we consider it now a best practice to look specifically at the data model of the software packages that are being used before you purchase them, because if you don't look at it until after you purchase them, you may discover it doesn't have the right structures that are pulled into there. Now, that's the question about the database that exists on the other side of the software, the software as a service. Remember, a cloud is a computer somewhere else, so there's no real magic to it. Everybody keeps thinking there's all these wonderful things. It's just another computer somewhere else. And that computer somewhere else has a database running on it, and that database was designed very likely using one of the techniques that we've described here to it. There are more and more offerings that are out there because people are looking for more flexible offerings. There's nothing wrong with that. And so when you get to the governance part, which is I think what the questioner was asking, and then please ask a follow-up if we don't get this right, but when people talk about governance, the data governance group should in fact be responsible for determining whether the software that is being provided, the SAS software application as a service, is being provided, is in fact going to match the organizational needs back to what Anuj said is having that purpose statement. If you don't have that documentation, all you have is something mysterious going on somewhere else in the world, and that's probably not what you want to entrust your business to it. So absolutely we consider it best practices to look at the data model. By the way, another thing that we do is we say, how long does it take them to get the data model? I remember when I was traveling with Larry English a lot, he used to come into places and he'd have a stopwatch. And he'd say, tell me how long it's going to take this data model. He'd set the stopwatch and measure. And the point is not is it 30 seconds better than a minute, but it was what was it measured in? Was it measured in seconds? Yeah, I can put my hands right on it. Or minutes? Yeah, I know who's got that. Or hours? I don't know where I'm going to find it, but I'm going to go have to look for it. Today's where, yeah, we couldn't find it. We looked all around. Finally found a paper copy in somebody's old desk that had been put in storage a couple of years ago. Yeah, this is terrible, right? This is organizational productivity. And more importantly, people are making decisions with incomplete information. Anuj, do you want to add anything to that on the modeling part of it? No, I think Peter, you covered it very well. Well, let's talk together about one more thing then, which is the governance, which was the real question. So people are always asking, what does data modeling have to do with data governance? And my answer to that is, first of all, the data governing group, I've already said too many times in this presentation, colleges and universities are not teaching people to do this well. Consequently, we do not have a lot of people out there in the world that have these knowledge skills and abilities. You guys are part of them. And the governance group should be saying, yes, are you looking at broad enough options? Are you looking at the way in which those options can be delivered? We can put it on your premise. We can put it on the serve. We can make it in the cloud. There's lots of different things we can do. We want people to be evaluating those decisions. Most people just say, well, that's the way it is. It's like, no, no, it's not the way it is. It's the way this vendor wants you to buy it. You guys are the customers, and I know you do this with your customer base as well. When they say, I want it this way, you do what you can to help them get it in the fashion that they have because they are the ones who best understand the purpose of their activity. And there are two different things we are talking about. One is the business requirements. So we are talking about the SaaS, right? So PR comes to my mind. Software as a service. So I'm just thinking that when it comes to governance, I don't know whether folks on the call have heard GDPR, which is a big boogeyman. That is now becoming kind of a law for the companies who are managing data for their customers, right? So any information, I mean, these are the, I would say guidelines that now the corporations who are building SaaS software for their application, they have to make sure that they don't share the information that belongs in a fashion which kind of impact their customers or without their approval, right? So that is all the governance aspect of it. But then the other aspect is about the designing of the data models, right? And if you are using an application like a SaaS application, then, I mean, how the data modeling will be done more or less would be abstracted from the end customers, but the person or the org who is maintaining that application, they will be interacting with their customers and making sure that the business requirements are being kind of captured in the data models, right? So on a day-to-day basis, customers would not be interacting with the data modeling challenges, but it is the, I would say, the proprietary of the org who is maintaining and delivering these SaaS products over the Internet. Great. And there's, let's just draw everybody's attention to, you probably shouldn't be listening to us right now because you should be listening to Mr. Zuckerberg deal with those issues in front of Congress right now as we speak. But of course, we don't want you to leave. We think data modeling is much more interesting than what's ever happening to Facebook. But yeah, it's in the news today. It's exactly right, a very timely example to bring up. How has a data model across various systems within a single enterprise done for every field? That's a great question. We tried at the Department of Defense to control the vocabulary of the Department of Defense with about 5,000 words, and we were going to make one data model to rule them all. We've learned over time that doesn't work. What we can hope for is that your data structures, which are subsets of your data model, again, I showed a couple of data structures in there where I showed the relationship with, I don't have to, as I can put it up on the screen, so the relationship here between the data structures on this, so here's the data structure for the VA system. Again, just showing what the relationships are between seven major entities in the Veterans Administration system. What we can hope for is that to the degree where it's appropriate and reasonable and efficient and effective to use standard data structures do so, because if you don't do it standard, it means you're going to have to translate. You have to maintain the translation rules and everything connects to everything else and if we're spending, again, remember my contention, 20 to 40% of IT costs are spent migrating, transforming, or improving data. So what we can hope for is that organizations will adopt large sections of this and introduce variations only when they're necessary as opposed to what everybody tends to do, which is to say, well, all the systems are different, so we're going to have to make point-to-point connections between each of them in order to do this, which means that one definition of employee may have my accumulated sales in it. If I go back to my manager example that I was doing earlier when I was talking about retail stores and my own personal experience, and another version of that may actually have a place where it can accumulate in there. We want to eliminate as much of that as possible, so the question is, where can I apply standards and how widely can I apply standards? And the use of standards is using standard data modeling components. Again, I'll pass that over to you see if you want to add anything there. No, go ahead. I think you've covered it all. All right. Would you please comment on the role of data modeling in the go-go-go world of agile software development, two-week sprints, et cetera? Is there a line of thinking that holds we'll sketch our data model as we go along? No need for upfront work? I have a phrase that I use that says that the only way that can possibly work, the only way that can possibly work is if your data is not shared outside of that project. If you think about it, it's true. So that is so wrong, and it is so incredibly important. And this is a problem because people think we're bashing agile. We are not bashing agile. Agile is a really good way of developing better quality software faster. But if you're trying to do your data model while you're in the middle of an agile sprint, the only possible outcome of that is going to be another small pile of data somewhere in your organization. And I finish that off by saying, and then you'll be a data blueprint or a couch-based customer before too long, but that's okay, so don't listen to us, and we're perfectly okay with this because it means more work. If you guys listen to us, there's less work for us to do. Agile is fundamentally incompatible with creation of shared data assets. Yeah, I'm back up on my soapbox, but go ahead and finish. I'm sorry. Yeah, I mean, this is a question that is applicable to what we do as well. So just so I don't sound, I have spent, I would say, more than a decade in a relational database company. And I have a firsthand experience building applications, whether it's a SaaS application, using database, right? So there is definitely a need and whether you are building applications through agile methodology or you're doing waterfall, being the best practices need to be applied, right? When it comes to data modeling. Now, the only thing I want to add to that is whether you do those data modeling at the schema level in a relational database or you do at the document level. I shared the JSON documents a few minutes ago, right? So the challenges need to be captured and the best practices need to be kind of implemented, no matter you are designing a document. Like I said, when you are dealing with the NoSQL databases, the schema is embedded in the document. So you have to make sure that you kind of captured even when it's a JSON document, you have to think about, is this a one-to-many relationship who have to create a collection, which is an array or a map in a JSON, right? I mean, the only thing that changes from these two different world of relational to non-relational is where those enforcement or practices would be enforced in the document or in the schema level. But if you don't do it right in my opinion, I think you're going to kind of, in some fashion, fail the SLAs, whether it's the performance, the scalability or the data integrity when it comes to relational or even in the non-relational world, right? So I can talk about the other things. I'm not sure if I have confused or I have answered to the question. Well, I know just one of the things you'll notice from our group here is that they are fearless. So if we don't get it right, they will come back and ask for explanations. So don't worry about that. We've got a great group out there. That's very true. So move on. How do you deal with the metadata? I mean, I have seen documentation in traditional files like Excel or even in text files as the tools in the market to manage all the entities, definitions, relationships, attributes and things like that. So metadata is shorthand for data about data. And again, just knowing that the example I have on the screen here has seven entities in it called user admission discharge encounter facility diagnosis and provider is metadata about that model. If you didn't know what those attributes, excuse me, those entities were, you wouldn't be able to talk about the rules that need to occur as you're implementing that little bit of data. All good case tools as Couchbase does and all the rest of them have a facility for recording these purpose statements that I talked about. And the purpose statement also, of course, is metadata as well. Metadata is what gives data context. So if I'm talking about a user and I've got a box and squares on it, I'm probably talking about an IT thing. If I'm talking about a user on the streets of Richmond, Virginia, I'm probably talking about a drug addict. Right? And I don't mean to say anything bad about my town town here. That's just one of the contrasts that we can pull out to them. I'm sure you're in your tool. You have ways of running reports on metadata and things like that. Yes, definitely. And it's more, absolutely. When you are working with the relational database, I think you have to have some kind of a tool. Even in North people world, for the Couchbase, we have, I would say, Enterprise Admin Console, which lets you kind of take a quick snapshot of how every document looks like, what kind of attributes you have in your document, and it helps, right? I mean, we are trying to solve the same problem. Only how the information is stored has been changed. But definitely, there are tools, consoles available for developers and for the operational guides who are maintaining and making sure the clusters are working as a share. Let me add one more piece onto that, too, Shannon, before we jump on to the next one. If metadata is data, then we should use data tools to manage our metadata. And the problem with putting it in a spreadsheet is that it's not directly integrated or integratable into your systems. So the more that you can go by purchasing tools like Couchbase and other tools that integrate the metadata along with the data processing aspects of it, the less likely you are to have translation problems, hitches as you're implementing any other types of things that are going on out there as well. So look for these metadata capabilities. On the other hand, be very, very careful, because many of these metadata repositories will get you into a six or seven-figure purchase. And most organizations are not really ready to do that. So what we do, we did this with a health organization recently. We had an organization that said they really couldn't keep track of their doctors. I know that sounds a little terrifying from a healthcare perspective, but it was a very reasonable thing. They just had a lot of different doctor types who had privileges at their various facilities, and they wanted to manage all that. So that would normally be called a master data management activity, and we do a webinar on that later in the year. We'll get around to that, I think, in August or September. But the problem with that is that one of the hospital systems had gone out and spent seven figures for a master data management solution or a metadata management solution in this area. They wouldn't have had a very good experience. Our numbers tell us it's almost 90% that's failed within five years, and that's not a very good record. So what we're talking about here is you could build in your own organization. Everybody has a copy of SQL Server sitting around because Microsoft has gotten its way into all organizations. And it is relatively easy to build a metadata-like repository in your SQL Server facility. I'm not saying use that and keep it going. I'm saying instead, what we want to do is practice. Learn a little bit about it. If you play with this for a year, you'll be in a really good position to have a great conversation with a vendor. But if you start to go immediately, like handing the keys to your brand-new Tesla to your 16-year-old person who's never actually had any driving experience, you just don't expect a good outcome. All right. It's moving to the next question here. Do you know any good frameworks for enterprise data models or good examples of such? That means finished models. So I would recommend a series of books to take a look at. The first one by Len Silverstone called Universal Data Models. It's a three-volume set sitting right here on my shelf. And more importantly, what Len did with this in addition to actually describing a whole series, because I called him up out of the blue one day and said, do you have a medical systems healthcare billing model? And he said, yeah, page 27. All right. There's a lot of them out there. And the key to this is those three volumes set that cost you less than $150, and they come with a CD. I know most people don't have CD players anymore, but if you can find a CD player, you can then incorporate those models in. Anous, just couch-based read. Irwin-type data models. Can I answer the question? Fair enough. I would be surprised if it didn't, because usually it's just EDL, so you probably do. But that's a great place to start. David Marcos got a book on that. And David Hay is actually the first author in that area, Universal Data Model Patterns. So that would be the places I would start with that particular exercise. And the key is, again, how many times do we need to build a payroll system from scratch? The answer should be never again. So there's lots and lots of these pre-done data models that are out there that you can gain access to. I've even got banks that cooperate on metadata in the background, because that's not against the law to do. Any comments briefly on graph modeling, as it may apply here? Do you do any graph modeling, Anous? It's definitely one of the pillars. There are different products out there. So far we have not ventured into the graph modeling, but you would hear, I would say, RDF graphs or property graphs in the market. And what it is, is the association of an object with another object. It's a very prevalent Facebook. We just use the word Facebook. So it's all built. I would say many of the use cases, not everything, because there is no organization today which is using specifically one technology. And that's the thing we were discussing before. For every specific problem, they pick the right technology. So what I was going to say is that in a Facebook, when you look for friends of a friend, those kind of things are coming from the graph database. But when you are looking for maybe, I would say image feeds or block feeds or whatnot. So they might be coming from a NoSQL database. So it's a mix of many things. To answer to the question, we don't have the graph, I would say support yet. But we are kind of making sure that the features that we have, they are kind of the highest quality with the best of the performances, scalability. And then depending upon how much is the app from the user community, we'll see if graphs make sense. But there are open source, Apache based open source. I would say plugin available for NoSQL databases as well, like the popular graph. Absolutely. I wasn't sure whether you guys had gone that way or not. It's sort of a niche market, but since we are talking Facebook day-to-day, yeah, this is a really good... The best way to think about graph databases is that there are certain types of data structures that work really well using just strictly pointers. And that's where the friend of the friend metaphor comes in that Anu's used on there. We're using something in there and many open data projects do this because the triples that we use to build open data lend themselves very, very nicely in many cases to graph databases. So in that case, you still can use data modeling in the relational mode to actually get your requirements correct, but graph database may be a much better implementation in the real world for it. So I would never suggest don't do the data modeling there, particularly if you're looking at a production system. And Anu's gave a great example of why you might want to postpone that. As he said, put it into the document itself because you may be doing exploratory work in there. And that exploratory work is a lot easier to do in the technologies he was describing. Well, Peter, Anu's, thank you so much for these great presentations and thanks to our attendees for all the great questions that have come in and being engaged in everything we do. We just love it. Just a reminder, I'll be sending a follow-up email by end of day Thursday with links to the slides, links to the recording of the presentations as well. And I think that's about it. And as Peter is putting up there, I hope everyone, we see everybody in a couple of weeks in San Diego, so much fun. And there's the upcoming webinars. Hopefully we see you at least next month. And if you have trouble figuring out which one Shannon is, listen to her voice because she'll be walking around this floor and you'll go, ah, Shannon's here. I can hear her. Look forward to seeing everybody in San Diego next month. Thank you. And thanks to Couchface for sponsoring today. Anu, thanks so much for joining us. It was great. Bye all. Thanks. Have a great day. You too.