 Hello, and welcome. My name is Shannon Kemp, and I'm the Chief Digital Manager of DataVercity. We'd like to thank you for joining this DataVercity webinar, Mastering Data Modeling for NOS Sequel Platform, sponsored today by IDIRA. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them by the Q&A in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share highlights to questions via Twitter using hashtag DataVercity. As always, we will send a follow-up email within two business days containing links to the slides, the recording of the session, and additional information requests throughout the webinar. Now, let me introduce to our speaker for today, Ron Huzenga. Ron has over 30 years of experience as an IT executive and consultant in enterprise data architecture, governance, business process, re-engineering, and improvement, program project management, software development, and business management. His experience spans multiple industries, including manufacturing, supply chain, pipelines, natural resources, retail, healthcare, insurance, and transportation. And with that, I will give a sort of Ron to get today's webinar started. Hello and welcome. Thank you, Shannon, and thank you for having me here today. I'm excited to be talking about this topic because, of course, with the proliferation of different types of data in our organizations, whether they're NOS Sequel, social media streams, everything else, we're dealing with more and more data in our organizations all the time. So being able to model for all these different platforms and all these types of data is extremely important and for the survival, frankly, of businesses going forward as we cope with all this information. So with that, I'll launch right into it and give you a brief overview of some of the things that we're going to be talking about today. I'm going to set the stage a little bit and talk about just data trends and data usage in organizations in general as a backdrop. And that's really going to feed into why we need data models in the first place. We'll then talk about data model types that we typically would use in an organization. We'll talk about capturing important metadata along the way, using extensions to the traditional data structures and everything like that that we might want to define. And then specifically, I'll talk about NOS Sequel platforms and some of the things that we have to do there. And examples I'll give and cover briefly are things like Kyve-based platforms and also as JSON-based platforms like MongoDB. Then we'll take a slight departure from that. We'll talk about what kinds of techniques can we use that for when we're deploying. So for instance, if we're doing some data modeling and we are going to be deploying to a graph database, what are the some of the things that we need to keep in mind there and how do we go about managing our modeling throughout the process? Then I'll wrap that up and then we'll have some Q&A to talk about it. In terms of data itself, when you look at the industry, and these figures are actually from a couple of years ago, so it's even getting worse as we go on here, we're seeing increasing volumes, velocity and variety of enterprise data that we're having to cope with in our organizations. Simply speaking, we're seeing in any given organization a 30 to 50% year on year growth of the data that we have to handle in our particular organization. On top of that, everybody is aware of the increased risk that we have, data breaches, non-compliance, privacy breaches, and just poor quality of data where we have to clean up the data to be able to consume it and use it in our organizations. It's estimated that the annual cost in the US alone for data cleanup is $600 billion a year, which is not small change, it's a huge amount of money, and it's increasing as we go along. By contrast, when we have these increasing volumes, we have this increased risk. What we're also seeing is that only 5% of enterprise data is typically fully utilized in any given organization and we're seeing that on the decrease and that's because as we're grappling and coping with more and more data, there's less and less that we're able to utilize effectively in the organizations themselves. Let's look at one aspect of that is, when we make the data available to our business stakeholders, this is a survey from a couple of years ago now and one of the questions that we asked was, do you suspect that your business stakeholders interpret the data that you're providing incorrectly? And when you look at the graph, it's quite astounding, only 9% could say that their business stakeholders never used the data or interpreted incorrectly. 10% didn't know, 14% said it happened frequently, and a 67% said at least occasionally that it was being used or interpreted incorrectly. When you look at that, it really means that 91% had a potential to interpret data incorrectly, which is a very large percentage. The flip side of that question was, what about if business stakeholders were actually using the wrong data to make decisions, not just interpreting it incorrectly, but using the wrong data? Luckily the know went up a little bit, only 13% though, and when we look across the other answers, that means that we still have 87% potential of people making decisions using the wrong data in their organizations. And of course, as we all know, that can have disastrous consequences. So let's repaint this picture. We've got more and more data coming into our organizations. There's increased risk and fines of non-compliance, and we're using the data less effectively as we move forward. And we still want more. In addition to the traditional data sources that we're seeing used in organizations, which may be our own databases, our own data warehouses, our own decision support systems, we're wanting to harvest more and more data both from inside the organization and outside the organization. And of course, the buzzword uses big data. I look at it though as all data that we really need to manage more correctly. And as we see these data streams coming in, whether it be from social media feeds that we're trying to reconcile and analyze, or even where you have internal systems where you're doing, for instance, plant data acquisition or pipeline data acquisition, and those types of things from sensor, you get this fire hose stream of data coming into your organization. And trying to interpret and make decisions with it really is like trying to drink from the fire hose. So we need a way to address this. One of the ways that this is being addressed in the industry, particularly when we start to bring in data feeds, non-structured data, and those types of things from outside our organization, as well as complimenting it with data from inside our organization is using the concept of a data lake. Now, the interesting thing about this is this is a different approach than our typical extract, transform, and load processes, taking data that we know in our organizations, staging it and moving it into data warehouses for consumption. We have increased risk in this area because not all this data is sourced from within our organizations. So the quality and the usability of the data can be suspect, and that's why we have teams of data scientists and those types of things trying to go through, do analytics on this data and make sure that it's actually usable and you can drive decision making with it. One of the things that people don't talk about very often, but it's very important to look at, is when you're analyzing this type of data that you're sourcing from outside, it can also introduce a bias in the results that you're looking at, and I'll give you an example of that. Let's say I'm a retailer and I'm trying to correlate success of a product that I'm selling and I'm trying to correlate that to market feedback. Well, generally speaking, and I think we've all seen this particularly in social media, people are more apt to communicate negatively about something rather than when they're satisfied about products or those types of things. So if we take that data verbatim without really interpreting it correctly, we may launch a great product, but the few detractors out there that may be tweeting about it or those types of things may be saying negative things and we may think that we have a negative product experience on our hands when it's limited to a select few. So again, when you're dealing with data that's harvested from outside and put into these data lakes, you need to be very careful in terms of how you interpret that data. Again, the data quality itself can be very suspect. So one of the questions I would ask you is what is in that data lake of yours? More often than not, it turns into a data swamp. When we deal with data that we source and control internally, it's typically more of an assembly line approach when you think of a manufacturing type of parallel, where we source the data from our internal databases, we may do some transformations on it, we may stage it and then put it into data warehouses and data marts. When you're dealing with data that's being pulled from all kinds of different sources into your data lake, again, you don't necessarily know what that quality of the data coming is at the data lake. So in contrast to the assembly line type of approach, you often find yourself building something that's taken into a water treatment plant to take that swamp water and distill it and make it something that's consumable to your business. It's very important to make sure that you're using the data correctly in your organization. And this means that there's a lot more time and effort spent on analyzing and cleansing the data. And that's why we have many data scientists in organizations today. This also brings, you know, in question and a new approach in terms of the value of data, which is extremely important to organizations, but there's now a new bicycle coming into play to how we deal with this data as well. We still need the key skill sets like data architecture for data design and management. In fact, we need those skill sets more now than ever before. We also still need ETL and software development skills, but we're really complementing that more with data analysis, statistical analysis and that type of thing to reconcile the inside data with the outside data that we're collecting. And of course, business analysis and just discovery are becoming more and more important. Things that we need to look at is to deliver value from this data, we need to make sure that we're doing proper validation. We're integrating the data from these disparate data sources that we may have staged in our data lake. We may need to do data enrichment to make it fit for purpose in our organizations. And of course, it's the usability that's very important. And it's not just the fact that the data is correct, but it also needs fit for purpose. And of course, I've already talked about this a little bit, is we have many more new data sources that we're bringing more data into our organizations. And so there's a different and higher total cost of ownership associated with dealing with data in our organizations as well. To be able to handle this, we're also outside the typical life cycle where we used to do things like data design on a data model, generate some DDL, deploy it out to a database and then start using the database for the application. Data modeling and analysis and architecture now is much more about discovering these data sources, capturing the data and understanding it through reverse engineering of those data sources into the metadata so that we can document those databases and data sources and then integrate the usage of that data into our environments. But why do we need data models, you ask? This is a slide that I've used in many presentations and I usually park it for a while and then I come back and it really is a very good paradigm to what we need to deal with in terms of modeling in our organizations as a whole. For those of you not familiar with this, this is the Winchester Mystery House. And looks can be deceiving because from the outside, this looks like a beautiful grand mansion that, you know, looks very expensive. It looks like a wonderful estate that many people might want to live in. It is a real with Mystery House. It actually is located in San Jose, California. And the building of this was commissioned by Sarah Winchester from the Winchester family in terms of the Winchester rifles if you've heard of those. But we look at this, this is something that's been built up over the years. So the evolution of this structure was 38 years of construction. There are 147 different builders. There were no blueprints for this and there was no planning. It was built ad hoc and added on and grown organically, if you will, over the years. That's a very similar paradigm to how we're dealing with data and systems in our organizations. Quite often we start with some legacy data sources. We've added other solutions into play. Now we're starting to harvest that data from outside in terms of putting it into our data lakes and everything else. And now we're trying to synthesize this and make it something that's very meaningful to us. But when we look at this Winchester Mystery House, we see some very interesting results. When we look at this particular mansion that looks so beautiful on the outside, we see several things, seven stories, but there are 65 doors that lead to blank walls, 13 abandoned staircases, 160 rooms with 950 doors. My favorite is 47 fireplaces with 17 chimneys. I'm not sure how that really works. Miles and miles of hallways, secret passages, and 10,000 window panes, and every single bathroom has window panes in it. So again, the end result, you have to be very careful and you have to plan and design accordingly to get a desirable end result, which of course didn't really happen in this case. Let's translate that now over to our world of data models. And what you see on the right hand side is the dimbok wheel for the data management body of knowledge and a number of the different aspects of data governance and the parts that come into play. All of these things are extremely important and whatever we're doing with data in our organizations, we have to keep data of governance at the forefront in mind and everything that we do. So for data models in particular, we can really use them when we reverse engineer these databases in or for designing databases to really understand what that organizational data is about. What's important? Where is the data? We can have data in many different places that apply to the same underlying concepts. We may have customer data scattered across multiple databases. We may be sourcing data about our customers from social media feeds in other places and bringing that in and trying to correlate it in. Where did the data come from? Very important. If you don't know where the data came from, how can you rely on its accuracy and fit for purpose? What's the chain of custody of that data not only outside your organization before it got to you, but also how is it used and utilized within your organization? And what are the business rules associated with that data? All of these things we can do based on a blueprint of data models. And then from a governance perspective, how do I identify the private information? Again, we talked about data breaches and data security early on. We need to make sure that we're identifying private information and we're keeping it private. Data retention comes into play. When information comes in, how long should I be retaining it? Some of it's forever. Some of it's for a period of years. It really depends on the different data classifications and how long you need to keep the information. It's also an area that a lot of businesses tend to ignore. The classification and retention policies around data is extremely important. How do I classify this? Master Data Management. Am I dealing with master data? Am I dealing with reference data? Am I dealing with transactional data? The way we analyze and handle it, again, is different. And we can use our models in the underlying metadata to categorize all this for us. And again, fit for purpose. Data quality is a huge concern. So we need to make sure that the data we're using is not only accurate, but it's fit for the purpose that we're using it for. Going back to those earlier slides, misinterpretation or misuse of data, this is a prime factor right here is that we need to document, make sure the data is of good quality, and that people understand the purpose of that data. And also audit trails. What changed in the data and why in its journey through the organization? Well, models in the associated metadata are our means to be able to master this. Let's talk about the different model types and diagrams that we may use. In terms of data modeling, and I'll talk about this a little bit more on the next slide, we're typically used to conceptual, logical, and physical models for those of us that have been doing it for quite some time. And I'll go into that in the next slide in a little more detail. On top of that, we have specialized data models for the types of things that we're doing. We have dimensional models for when we're doing data warehouses, datamarts, and those types of things. We also have specialized data models for our deployments to some of the NoSQL platforms. The physical implementation characteristics are different than what we're used to in some of our relational models. And in fact, some of the constructs that we need to be able to support are not supported in relational notation or relational databases. And data lineage, knowing where the data has been, the transformations that the data has undergone on the way through, and what the target systems are. And data lineage isn't just for ETL processes. It can also be used to document your data flow within multiple steps throughout the organization. Where you really get context is when you start to take that type of information and then also correlate it to your business process models. So now you'll not only have a statement or a blueprint of what the data is in your organization, you also know the lineage of how it's been used and transformed. And if you can associate that with the business process models and the business processes acting on that data, you are now able to tell the full data story in your organization. So back to the data models themselves. Let's talk about the three traditional levels of data model that we use. First, we talk about the conceptual models. And it's important that we remember that conceptual models are technology neutral. They're the high layout level layout of entities in their relationships. And it's really used to understand established a contextual consensus amongst your business stakeholders. We go one step further. As we start going down what we might call an elaboration path is the logical models. What we typically do in our logical models is we start to add more and more detail. So where the conceptual may have brought out some business concepts, some partial attribution, we're now adding more attributes and characteristics to our models. We're giving more context around the relationships that are between those concepts that we're representing. And those are business rules. And we're also documenting it in terms of terms and definitions. Then when we find to get down to the level of physical models, that is the first and only place that we should really be thinking about what is the physical implementation and deployment of this data. Again, a physical model is tied to a particular database. And actually I'll say data store implementation because we're also talking about no SQL data stores here. And this is where we include the implementation details. So on relational databases, we're talking about things like indexing, federation, no SQL platforms. We're talking about basically the file systems and where it's loaded and everything like that. So we're really talking about very concrete implementation characteristics. Now again, our lifecycle has changed. So we're not often going through this conceptual, logical, physical layered modeling in all respects where we're consuming data from the outside and we're trying to take it out of our data lakes. We're actually going the opposite direction. We're doing things like reverse engineering, figuring out what those physical data stores look like. Then we're trying to derive meaning from them. So we're going through an abstraction process where we're now trying to take these physical characteristics, move up into a logical level and actually assign definitions, meaning and business rules to it and then up to the conceptual level so that we can communicate that out to other areas of our business. Let's look at some of the general types of constructs that we deal with when we're trying to do this. When we look at a data model, it's much more than a picture or boxes of entities and relationships. It really can, especially on the physical level, can be the full physical specification of what those data stores represent. So again, logical and physical when I'm working on this particular layer, we can also have things like persistence boundaries defined. The diagram you're seeing here not only has the entities and relationships to each other, but these groupings or boxes around them are what we call business data objects. So it allows us to encapsulate related entities or tables into the business objects that they represent. So for instance, if we take something like a sales order, typically a sales order is comprised of an order header, order details, and other things. Data models and architects get this and this is how we design things. Business users and often developers think in terms of that higher level of object and how it interacts. So we now have the best of both worlds where we can actually encapsulate that information and represent the technical implementation from tables and relationships, as well as the business objects that are made up of those tables and those relationships. Descriptive metadata is extremely important. Names, definitions that we put into our data dictionary, notes about how we use and implement the information. Implementation characteristics such as data types, keys, indexes, and views all extremely important. And again, when we look at these data models, it really is a statement of business rules. All the relationships that are referential constraints from a relational model represent business rules about how those pieces of data interact with one another. We also have value restrictions, which we may see as implemented as check constraints and that type of thing. And of course, security classifications and rules. We talked about different security and penalties earlier. Different organizations are subject to different regulations. There are some that are global that apply to all, but you may have things like GDPR, which is a global type of thing that applies to your business. If you're in health care, you're dealing with regulations like HIPAA. So what you really want to do is you want to make sure that you have the governance metadata as well. In addition to security classifications, you also want to make sure again, things that I was talking about earlier, master data management classes, data quality classifications, retention policies, that can all be defined in your metadata and at the model level. So if I come in a little closer here, I'm going to show you how a type of way that we can do this. You'll see this one looks a little bit different. You see a little bit of a zoom in on the entities here. And we're seeing a number of things. We're seeing not only the entities and their attributes. We're seeing a number of characteristics that are defined in our enterprise data dictionary. If you look down my left pane, they're called attachments because we attach them to the different objects and they can apply it. Entities, attributes are now the things. But the things I talked about earlier, like business value, data retention, things like ownership, like who the design stewards are, master data classes, those are all defined in our data dictionary, as well as our compliance mappings and our security classifications. We can document this on the metadata of every individual entity or table in our models. And we can go right down to attributes, domains, however deep you need to go in terms of defining that metadata. You need to define and document that in your models so that it can be consumed properly downstream. This is like the blueprint for your business. And one of the things that I would like to kind of call here is let's assume you were building a bridge or a skyscraper. Would you do it without a blueprint for how to construct that? Our data models are no different and our data architecture and our organizations is no different. We need that blueprint and we need these models to tie it all together so that we can, we know how our data works in the organization. Let's switch gears for a second here and let's go beyond the typical relational constructs. Let's talk about a couple of different implementations here in terms of physical deployment. So on our physical models for something like Hive, you can see that the structure of it's a little bit different. If you look at the right side of the screen, what you're going to see here is HiveQL, which is the conceptual equivalent of DDL that you would have if you're used to SQL data models and that type of thing. The syntax is very similar. It talks about the fact that there's a key. It talks about whether it's strings and those types of things for data typing. There are comments. There are things about row formats. It also talks about terminators and delimiters and those types of things. And it also talks about where it's stored in a file system physically. If you look on the left side, there's some other information that you see here. So all of these things are generated from the characteristics that we can define in our data model. So if we look at things like the row format, is it delimited? Is it serialized or deserialized, deserialized? All of these metadata characteristics are captured and that helps you to create those structures and deploy them out to a Hive database. We can also reverse engineer and pull this information in if, say, our data lake was built on Hive structures and that type of thing, it gives us a way of understanding what's out there. Let's switch gears now to a document store like MongoDB, which is based on JSON. When you start reading this, it really is more like reading a programming language or syntax because what you've got is it's very similar in terms of characteristics that you think used to using XML or those types of things. You really have a lot of tagging, meanings in the syntax, and then the data is embedded within that. So if we look at this particular example, what you're seeing here is we're talking about a small publishing database here, but what we see here is we've basically got O'Reilly Media, which is a publisher, that's where we started with the insert of the publisher at the top of the screen here, O'Reilly Media is who it is, and we see that by the squiggly brackets that there's an embedded nested object for an embedded address structure here. We can also see a slight difference here. This one looks a little bit different. This is a patron, but what we actually see here is this particular instance of the patron has a structure, and you'll see the square brackets that are there as well. And that designates when we see it that there's an array of objects here. So what you see now is we have a particular patron here, which is who's named Joe Bookreader, and he has multiple addresses. So there are multiple address objects embedded within this structure. Now, the interesting thing about things like MongoDB or JSON databases is that at every instance, so we may have some instances that have multiples, we may have other instances that have certain attributes, and in other instances we'll see different attributes, and we don't have to have placeholders like we went in a SQL statement where we are doing an insert where we have null placeholders and that types of things for data that we're not inserting. We just basically say this is the data that's going in. It's a variable structure, but that also brings about its own challenges. One more when we look at this. Again, we're kind of looking down in the JSON code here, and what we're really looking at here is how do we figure out from JSON which things are related as well? We saw the embedded objects and object arrays, but how do we find things like our typical references or how we think typically in a relational model are referencing relationships? Well, this is done by the related object IDs. So for instance, we have a title, which is the MongoDB definitive guide in this first example, and then when we have a publisher ID, it really is an object ID, which is a serialized object ID of another object in this MongoDB document store. As you're going through this, we're seeing the tags, we're seeing the data, but how do we still that out and actually make sense of it? Well, there again, the answer to that is a data model, and it's a different type of data model because now what we're dealing with is we're dealing with objects. So if you look at the left hand side of the screen, you're now seeing instead of things like entities and attributes, you're seeing objects, you're still seeing the fact that we detected relationships going on there, and we're also talking about fields that comprise those objects. When you look at the actual diagram, what you're seeing is those things that we talked about with those embedded objects in arrays are depicted a little bit differently on the diagram. We have what's called it is contained in notation here, which for instance, on the publisher has one address. So if you look at the relationship line between the publisher and the address, you're going to see that there's a diamond on the one end of the cardinality and one in the address. That tells me that there's a single address object embedded within the publisher, and you also see that at the bottom of the publisher that address object is indicated there as well. We also had patrons that had addresses, and if you'll remember, they had the array of addresses. So what we actually see there is we see the same type of connectivity, but we see an asterisk on the address end rather than a numeric cardinality, and that tells us that it's an array of addresses that are in there. Now how do we do this? Because what I said earlier was they can change by every instance of the record. So to be able to turn this type of thing into a data model, we actually need to interrogate the JSON throughout the data store, figure out all the different constructs that are out there, and all the different attributes in the different instances of record that have been characterized for these, as well as the embedded objects and the embedded object arrays, rationalize that, and then come back with a picture of what that data model looks like. And a picture really is worth a thousand words here. You can't really figure this out without a picture just by reading through the JSON code on a very large data store. So now let's talk about modeling and techniques for modeling. Now it's interesting, again, with these big data deployments or no SQL deployments that, as I alluded to earlier, our enterprise data environments have become increasingly complex. Not only due to the volumes of information, but we have a proliferation of these different technologies and back-end databases or data stores that we're using. Now, a given organization may have hundreds of different data stores, including multiple relational database platforms in-house, document stores, other no SQL platforms, basically data streams, all those types of things. What we're really seeing, though, is that the rate of adoption for no SQL also varies greatly in organizations. Some have had remarkable success on the leading edge, while others are still evaluating the feasibility. And they haven't really jumped into it because, for some organizations, the relational platforms are serving them very well, and they haven't necessarily had a need to jump into no SQL. Again, you don't want to jump into a technology for technology's sake, you want to do it because there's a benefit to your business. That's what successful adoption is all about, is making sure that there's a business need that you're actually meeting and implementing there. So when we look at this, what we also see is a lot of the no SQL platforms are touted as being radically different. But are they really different? Well, yes, the underlying implementation is different, the deployment is different, but our modeling benefits and principles that we've had for many years still apply. Again, remember I talked about the conceptual, logical, and physical layers, and that the physical modeling was the modeling layer that's technology dependent. The other, conceptual and logical, should and must remain technology independent. We need to focus on understanding the business rules, the relationships of the data concepts in our business to understand them properly to make sure that we take proper advantage of the physical implementations that we're deploying. Again, the conceptual modeling is business concepts, high level of abstraction, no implementation detail, logical, data and business rules of the data. It's still technology agnostic and the physical modeling is where we talk about the platform specific implementation constructs and details. So sometimes I get the question, how do I model if I'm going for a graph database? Well, my simple answer is at the conceptual and logical layers, you're going to model exactly the same as you had before. It's about the business and capturing the business rules, the conceptual relationship between the objects that you're trying to model in the business, and the characteristics of those objects. You rationalize it further as you get down to your logical and your physical levels. So here's a fairly simple example that I'm showing and I'm starting with a conceptual model that has some attributes on here. This is an example of a very simple air taxi operator. They have a fleet of small general aviation aircraft and they travel between rural airports and larger commercial airports and point to point travel between mid-sized airports. There's a number of business rules that we need to encapsulate here. Basically, each airport has a unique identifier, pilots are stationed at the different airports or aerodromes, which are their home bases. They're dependent on the flight of the aircraft used. Some flights have one single pilot cruise, other have two pilot crews. Flights occur between an origin and a destination, and departure and landing times are recorded as per aviation regulation. All the typical types of things that you would see. Also things about the descriptions of the aircraft themselves. Every aircraft has a registration, serial numbers. They have different models, which means there's a different airframe, propeller, engine or engines, those types of things. So there's a lot of information that we would start to record here. And again, I'm not fully elaborating this in the example, but this just gives you a basic idea of where we're starting from. So if we were to take this one step further to a logical model, this one I've kept very similar. I've actually didn't introduce any more attributes here, but typically we might have seen in our conceptual model some that didn't even have attributes. Whereas in our logical layer, we would then start to drive out more and more attributes about the subject that we're talking about. The relationships here become very important. Again, all these business rules that we talked about in terms of aircraft having being a certain model, or flights having departures using a departure and destination aerodromes, aircraft being assigned to a particular flight, the pilots and the crews being associated to a given flight and those types of things, that information is captured here. Now if we were going to do this as a relational database, our physical deployment would look remarkably like that logical model did. So if we take this as an example, what we've done is we've actually translated in, we still have our relationships, which of course are represented by constraints, or referential constraints in our physical database. And now we see all the foreign keys starting to come in to represent those relationships to those different concepts that we saw in the logical model itself. But now let's talk about doing this for graph databases. Now for those of you that aren't familiar with it, there are a few fundamental constructs that form a graph database and they're called nodes and edges. Now they're fairly similar in some respects to what we see in the relational databases. Bodes are essentially the same as entities or tables and edges are really the equivalent of relationships. There's no such thing as a referential constraint and edges can also contain properties or attributes that describe the edge. Now interestingly enough, you may see different terminology out there depending on the graph deployment or graph platform that you're talking about as well. So some use the term edge, others will actually use the term relationship and sometimes you'll hear the term link used as well. The three are conceptually equivalent when you're talking about graph databases. It's just different terminology on some of the different platforms out there. But what is a graph? A graph got its name because it really is a network or a series of nodes and links that pull these concepts together. And the way that we would retrieve a result set or a set of data coming back is by traversing those different nodes along the links that pull all the different pieces of information together. The edge is really an implementation table that has pointers to the nodes that it connects together. And of course you can actually use multiple nodes to join them together just like you might do a join and sequel to pull different things together across different tables across those referential constraints. Now the direct linking though in a graph database can often allow a very complex set of nodes and edges that you need to be retrieved. For those of you that have been around the industry for quite some time like I have, you're probably scratching your head and going well wait a minute, didn't we already do this in the 70s with things called network databases? And the answer is yes we did. Graph databases are extremely similar to the network databases that we had years and years ago. The underlying technological implementation is different but the same types of principles apply. So now that we kind of have that background, how do we actually create a physical model which will turn into our physical deployment for a graph database from this business model that we've put together? Well I've done a couple things here. I've actually utilized some constructs here and I talked earlier about those metadata extensions that I used to define things like master data characteristics and that type of thing. What I've done here is I've created one for what the graph table type or graph construct type it is. So all these items in our logical model I know that they're going to end up being nodes in my graph data store. So what I've done is I've created an attachment type of graph table type and I've given them all the value of nodes in this particular implementation. Now what I want to do is I want to be able to take this and resolve this out to take all these relationship lines that I'm showing in my graphical model and I want to be able to turn that into the edges in my physical model and drive that physical implementation. Now what I've done in this particular model is I've also used a neat little trick that we can do. These relationships that I've got in my logical model I've actually made non-determined relationships in other words they could be a many to many or that type of thing because when we go and typically generate a physical model we need to resolve those relationships. So I use a neat little trick to be able to do this if I'm going to graph. What I do is I actually have a characteristics when I'm generating physical models in my modeling tool that I can say what do I do to resolve these non-specific relationships and I choose the option of creating an associative entity. To overlay terminology there what I'm really doing is I'm creating an association through an edge by doing this when I turn this into the physical model. So I would click my physical generation and voila here's a whole bunch of tables or whole bunch of boxes that show up on this particular diagram. Now there are no referential constraints but I've left the relationships in this picture so you can actually see how they are linked together so you can trace the links together between the nodes and edges of how this is deployed out. We had the six original nodes in our logical model. To resolve those relationships that we had we had 10 relationships there so we've actually ended up with 10 edge tables that have been added between those nodes and again I'm showing the pictures here to show how they tie together. What I've also done here is I've color coded the edges just so they stand out a little bit differently on the diagram itself and you also notice I used that same type of an attachment and I actually set the default for that so I could get the tool to do heavy lifting for me here that the default value is edge. So when I generated these out they all picked up the value of edge so they actually show up in a nice assembly of nodes and edges for my graph database. Now interestingly enough you don't necessarily have to go to a graph data store to do this. You could actually implement this and push it out to a relational database if you wanted to implement a graph type of structure here but you wouldn't you typically wouldn't have the foreign keys and that type of thing when you actually go to a true graph data store like Neo4j or something like that. The connector lines of course I've left them purely for documentation but they're also not generated so if I even was going to generate this out basically the generation property is turned off so there would be no constraint to turn there. And of course something else that you see when you're creating a graph database is we don't typically have business keys as primary keys. In graph databases it's a very common practice that you would use a synthetic or a generated key and you'll see that I actually am using IDs in the way I've actually constructed this as well to show those primary keys in the graph database itself as well. So how do we put all these pieces together? We have all these entity instances and all these different platforms in our environment. We have no SQL let's say we've got Hive let's say we've got Neo4j we've got relational databases and that type of thing. And really what we've done is now we need to be able to tie together all these like concepts and how they correlate across our enterprise. So we need to find out where are the duplicates? Where are the complementary data structures? Where do we have instances of one data structure extending another with more information? We want to talk about things that I talked about earlier like the data lineages, the chain of custody, the transactional staging, ETL, warehouses, marks, the transformations that occur in those ETLs. What's the flow of data through our organization? And we're going to see these data flows against all these different types of platforms whether we're doing the traditional forward into a data warehouse or whether we're pulling things out of a data lake synthesizing it and putting it into decision support and that type of thing as well. We also want to tie together the business processes right down to the level in certain instances where we want to actually define the CRUD in other words create read update and delete that are associated with specific data stores for those business processes and we want to tie together things like the business rules and the ownership of that data across all these pieces in our organization. What we also want to do is make sure that we're assigning true business meaning to it. So in an enterprise data environment we also want to have to be using business glossaries with terms that we can tie in and associate with all these different business constructs that we have in our models. So here's a typical kind of scenario and this is a grosal for simplification. Again you may have hundreds of data structures or data stores in your organization. You may have a number of different concepts across these different models that are named differently but correspond to the same underlying business concept. So the way I typically approach it is I'll set up an enterprise model some people call it a canonical model where I represent these concepts of interest to my organization. I then have all these other implementation models and I've got two very simple examples here with the gray boxes for these implementation models with these constructs. So maybe my business term that I use in my organization is supplier but if you look at my red links here I can see in one implementation model I've got it called a vendor but it really is the same kind of concept. In my other implementation model it's called suppliers in the plural rather than the singular. I see the same thing with items where maybe I call it an item at my enterprise level. In one database it's called product. In another database it's called parts. It may be called item or items in a third and it goes on and on and on. What you need to be able to do is you need to be able to have all these models in a repository so that you can link them together and establish these crosslinks between the data. So what we do is we can actually pull these into our repository and then we create something called a universal mapping which is really links across these constructs. Now why would I do that? Why? Because when I'm working at it I want to say where and know where all the other instances of this are in my environment. Now obviously it takes a fair bit of work to get there because you need to link these together but when you do so you come out with something very powerful in your modeling. This is an example of an entity editor in my enterprise model and because I've gone through and I've established these links to these other implementation models in my business I can now do a where used and all I do is go to that tab because it's preserved for me in my repository and I can see all the other instances of this particular object which happens to be a customer in this model. I can see it in all the different logical models where I've done it in a conceptual model. I can see the linkage there because it may have come from a conceptual model into a model into this model as well. If I've utilized it in a data warehouse or those types of things there are no SQL platform I can tie them all together. It doesn't matter what the physical deployment is I can see all these different instances of it in this particular view. What I could also do if I want to get very granular is I can take this down to an entity or an attribute or a call level because there may be certain business attributes that are very critical that you want to tie together where it's being utilized and which data stores in your organization. You can do that by setting up the models for them correlating those models to a data store and then having that tied together. The next thing that I talked about is now that we have all these model constructs from our different models in our business we can now start to put together and stitch together data lineage so that we can start and this is a simple example of a typical staging area where we're taking state information on a staging tables doing transformations on them and loading them into a data warehouse. We could do the same thing for a more complex process where there are flows and transformations along the way. I've actually used these types of models to take something like a very complex stored procedure with a lot of inputs and outputs identify the input flows and the output flows and how that worked together in terms of the in that complex stored procedure and the logic that it represented so you can be very created and do a lot of things when you're modeling out data lineage. Again you want to make sure that you're tying this together with business meetings so you want to make sure that the business glossaries where your business users are maintaining that terminology or nomenclature of your business is also tied into your data models. This is even more important than ever with these no SQL constructs because we need to make sure that we're assigning business meaning to all these things that we're sourcing so that our business users are using the correct data to make their decisions because we don't necessarily know where it came from until we've gone through the analysis to do so and we need to capture that in our models. Once I've done that even in my modeling environments I can do certain things like when I'm working on defining things like entities or tables in my data models if I have them tied into my business glossaries I can actually this is an example of some of our modeling where any business term that happens to be in an associated business glossary is automatically underlined and hyperlinked if I mouse over it I can actually see the business definition as I'm driving out and building this model and deploying it so again a very good tie-in the other the flip side to this is quite often in organizations the data dictionaries that we have are often the starting point for our business glossaries so we can do we can do multiple things we could export them out of the tools import them into business glossaries or as you're working with things and you're creating things in your data models you can actually push them out to be business terms as well in your business glossary so again we're painting a very tight picture of our organizations so we've covered a lot here today but I just want to summarize very quickly again our organizations are facing huge increases in data volume more than ever before which means that our effective data utilization is decreasing we're seeing this against the backdrop of regulatory penalties and fines and users are making incorrect decisions with the data and or using the wrong data for some of those decisions and of course the big data adds significantly to the volume and complexity so in our quest to use more we're actually becoming less effective unless we properly manage and model this with with data models for not only the traditional relational platforms but also our NoSQL platforms as well again what are the differences at the conceptual and logical levels you shouldn't see a difference in your modeling you're capturing the business rules and you're also capturing the interrelationship and the attributes of those different business concepts to each other when you get down to the physical level that's when you start introducing these different implementation constructs not only for the different relational platforms but also for the different NoSQL platforms based on the types of characteristics of they support and again even just data models isn't enough we need to start to tie this together with other model types in our organizations to stitch together a comprehensive picture of our enterprise environment which means data models process models that tie it together data lineage the universal mappings that I talked about or a way of linking your constructs across the models for the same business concepts extending it with metadata characteristics for governance whether it be master data management characteristics data retention who the stewards are those types of things and then of course pulling that together with business glossaries to assign that true business meaning to all of your data in your organization that's it for the formal part of the presentation itself so we'll now I'll turn it over to share to to moderate some questions and the way we go Ron thank you so much for this presentation we have lots of great questions coming in already so if you have questions go ahead and zoom in the Q&A section in the bottom right hand corner of your screen and just to answer the most commonly asked questions we will be sending a follow-up email by end of day Thursday with links to the slides links to the recording of this session and anything else requested throughout so diving right into it Ron here based on the MongoDB side when I reverse engineer a no SQL database such as MongoDB or DynamoDB however etc you know do I have somehow do I have to somehow read all of the data so I can account for all potential relationships in the data perhaps I cannot reverse engineer a no SQL database yeah the way that we actually do it is and we realize that these data stores can get can be can be huge so if you want to catch absolutely everything I guess the academic answer was he wanted to interpret and parse all of it but you don't need to the way we do it is we say based on the size of your database take a representative set that you think will capture you know most of it so we actually have a parameter that we used to say how deep do we really want to go and how many how many do we actually want to go through to to pull that in so you again trying to bring it all in can be you know can chew up memory and maybe strip out strip the capacity of the memory on your computer and that type of thing so we parameterize it so we can just do a representative data set that should capture the bulk of those relationships for you so this universal mapping require repository use yes I mean you and interesting universal mapping is actually a construct that's unique to our tool and the reason that requires a repository for us for us to do it is all those models in the repository so the way we link them is you can actually start with a concept and then it will show all you can it'll show all the other databases and then you can actually do things like searches on names and those types of things to filter out your list and then and then build up your universal mappings now you could in theory maintain something externally but you know that would be a huge documentation exercise where you would be building yourself some type of a cross-reference and of course by having that tied in directly to your modeling tool that gives you the ability of of the where used of being able to very easily see that and the cleanest approach I found as well is again you don't want to do like a many to many like start doing it from a whole bunch of different models I always like to consolidate it from an enterprise view like that enterprise type of model and tie it back to all the other models because then you have that one focal point where you can go to find out the instances across your organization so what ensemble data modeling for example hub satellite link be more suitable for no sequel modeling you know it really it really depends I mean it's interesting we have all these different no sequel data platforms generally speaking one of the reasons that we adopt no sequel platforms is the constructs can be more flexible there's it's a little easier to stand them up and get them going and of course the the structures can change on a lot of these on the way through for instance of record as opposed or the equivalent of a record as I should say whereas when we do something like a relational database we have to define the structure and account for in all instances that data so we end up with a lot of blank columns and those types of things so really the the deployment of no sequel is a lot more around is around more on the flexibility and often the cost to quickly deploy however I'll say that you offset that by for instance putting in somebody like a programmers hands to actually define this on the fly as they're utilizing the data you've now flipped it on the other side where in order to make sense of it you now need to have more reverse engineering more synchronization to make sure that you understand the data that's out there because you're not designing up front per se you're actually trying to recover that design on the back end so I'll go into this for a little bit more our traditional life cycle it was almost more like a manufacturing analogy where we would design build out our assembly line of our modeling layers and then deploy out to a database what we've really done here is like I said earlier is we're really building a water treatment plant to try to sift that stuff back in turn it into something that's usable and then apply business meaning to it so it's a lot more like an archaeological dig so deriving meaning for it could actually be a lot more complex you know Ron we get a lot of questions about metadata you know would and you talked about it quite a bit um but would the complexity involved in doing this at the enterprise scale require require an additional tool uh a metadata management tool instead of doing it straight up in in ER studio well with ER studio in particular what we actually have is our repository does house all of the all of the models and all the metadata constructs like all those all those attachments and everything I talked about do become metadata in our repository and we actually have a metadata repository called team server in the ER studio enterprise team edition suite 2 and we could actually tie all of that metadata together as well as your business glossaries terms and all the rest of it now some organizations do already have certain external metadata repositories but there are ways of integrating them or interfacing the data back and forth between those metadata repositories and ours as well are you still there I thought I heard you asking and then you dropped out oh sorry come on cd just a question for you actually on mongo db which is um you may or may not know does mongo db provide uh join features through uh api for low latency joins programmatically can you hear me yeah a mongo a mongo db your your joins are really through those through those object IDs are those tags for your embedded structures so it's basically it's really the jason syntax that ties that together now based on the tools that you're building over using over the mongo db you're obviously going to harvest that from the mongo structures and some of the tools I believe will will allow you to do that in more of a passive fashion from a modeling perspective you definitely want to capture it by interrogating that jason structure and bringing it in so you can under draw yourself a picture and understand what's really there and I don't know if you're talking right now shana but I cannot hear you again there we go sorry I lost the activity there for a second I know that's really um so Ron I didn't hear the end of that question but that's okay so um uh just going back to the questions um oh we have one more question going on here that I think we have time for sorry for that I'm trying to just make sure we get in and so how to manage the constantly um oh let me get this going up the constantly changing there we go um evolving aspect of the conical model element for example the management of the life cycle status um the data standard the draft standard the retired etc yeah so I I guess I guess the quite the the answer to that is data models you know obviously obviously when you have an image of it or something like that it is at a state in time but the important thing to do is make sure that the way you're approaching your data architecture isn't just at a snapshot in time I mean what you want to be able to do is you want to make your data models and your underlying metadata active active living and breathing so you actually want to design and push out if you're not doing that like especially in some of these no sequel constructs and that if you have developers doing that and you're getting out of sync that means you're going to have to have to have a disciplined approach of reverse engineering what they put out there and comparing back to see if there have been any changes that you need to accommodate in your models perfect and that brings us right to the top of the hour here Ron thank you so much for this presentation and thanks to idea for sponsoring today's webinar again just to remind everybody I will be sending a follow-up email by end of day Thursday with links to the slides links to the recording of the session and and additional information throughout so thank you again Ron thanks to everybody for being so engaged in everything we do we just love the active participation in the chat going throughout the presentation and love everything that we're doing so Ron thank you again and I hope everyone has a great day thank you have a great day everyone