 Hello and welcome. My name is Shannon Kemp and I'm the Executive Editor for Data Diversity. We'd like to thank you for joining this month's installment of the Data Diversity Webinar Series, The Heart of Data Modeling, moderated by Karen Lopez. Today, Karen will be discussing the key to keys. Just a couple of points to get us started. Due to a large number of people that attend these sessions, you will be muted during the webinar. We very much encourage you to chat with us and with each other throughout the webinar. To do so, click the chat icon in the top right corner of the screen to activate that feature. For questions, we'll be collecting them via the Q&A section, or if you'd like to tweet, we encourage you to share our highlights or questions via Twitter using hashtag heartdata. As always, we will send a follow-up email within two business days containing links to the recording of this session and additional information requested throughout the webinar. Now let me introduce our speaker for today, Karen Lopez. Karen is a Senior Project Manager and Architect at InfoAdvisors. She has 20-plus years of experience in project and data management on large multi-project programs. Karen specializes in practical application of data management principles. She is a frequent speaker, blogger, and panelist. Karen is known for her fun and sometimes snarky observations on data and data management. Mostly, she just wants everyone to enjoy and love their data. You can follow her at data chick on Twitter. And with that, I will turn it over to Karen to get us started. Hello and welcome. Hi, Shannon. Thank you so much. And I'm sorry that you're feeling under the weather, but I'm hoping that means that you had a lot of fun at a conference recently or something like that. Exactly. That was a lot of fun at Nurse Equal Now, which is kind of an interesting specific way, but because people think of no-SQL databases as being schema-less or there's no data model or there's no structure to them, they often also don't think about keys or how to uniquely identify something. And I've been thinking about that as I was working on this presentation. Now, this one's going to be focused only on relational databases, so traditional relational database concepts and terms. But I think maybe in the future, I might think about other ways of how we might think about this. So, interesting segue there. So, yes, please do tweet and share. I'm at data chick and our hashtag for our event is heart data. As Shannon said, I've been doing this for a while. I love tech and gadgets. I want you to love your data. And I'm very encouraged that so many of you have registered and shown up in person to talk about these things. One of the reasons why I even decided I needed to do a presentation about this is because I do presentations both about data modeling and about databases and no-SQL. And I was used to thinking about, you know, I draw out terms that aren't specific to a database, and then someone would finally say, well, what do you mean by alternate key? Because a lot of people burn their data modeling coming from the developed programming world, and therefore they might not have been exposed to sort of the underlying concepts as much, or some of the business side of thinking about keys and identifiers. So there's all that. But I'm going to hopefully today be able to help you distinguish between what's the word that the data profession uses to talk about a concept versus the way a particular DBMS does. So some of my, how a relational database implements something, my examples come from SQL Server. But because we're talking for the most part about very fundamental concepts, a lot of times the names are the same across other products, or they're very similar. If you know of another name for something I'm showing, I'm also collecting those as well. So if you want to put those in the chat, we can collect those. For some of the concepts, there's just so many terms for them. So don't forget to put your formal questions for me in the Q&A. That's the best way for me to see them, especially because I only have one monitor today. And that please chat with each other. Feel free to discuss any of the topics. I think that's one of the best things about the webinar we do. And as Shannon said, for those of you who are joining at the end, yes, you will get the slides. Yes, it will be next week. There will be a recording. If you're having any audio or video or other types of logistics issues, Shannon will take care of those for me. So other thing, there's so many terms in jargon, in IT in general and in database stuff, that as I talk to attendees at conferences or people at organizations that I help, I'm trying to translate all the time. I might be talking to a data modeler and might use the term alternate key, but for instance, my DBA might have no idea what I mean by alternate key or what the difference is between an alternate key and a candidate key. And so that's what this presentation is about. But it's also the translation, not just polygop persistence across the different types of databases, but for instance, is the identity property. It's a property in SQL server, but in another DBMS. So that's a data type in SQL server. And another DBMS, it might be, or it's a property of a column C, even I'm missing it up, but in another database, it might be an actual data type that you choose. And so that's one of the things that makes it hard to be a data professional on an enterprise class system. It's dealing with vendor jargon, modeling jargon, the vocabulary that we have. The data modeling community, definitely we have vocabulary. But even that vocabulary isn't very standardized, because it kind of depends on what tools you use. One of the mantras I say is your tools impact your design and your vocabularies much more than you ever think they would. When you learned modeling and from whom you learn modeling will help form which set of terms you're using. And then what target DBMS is, or file stores, or data stores you've used. And then just general slang and jargon and the different vendors that use them. And one of the issues is that, you know, so many people learn databases and data modeling through apprenticing in, is that I also see plenty of myths and misunderstandings and misuses of the terms. In another presentation I'm doing at a conference coming up of ticking time bombs in your database, I have to spend so many times, so much of that presentation talking about just one of the concepts we're going to talk about today, that these, that most people haven't had the time or haven't been exposed to the sort of nuances of having made these decisions. And a lot of times that's because people learned how to do this by picking up one book or being forced to be something like an accidental data modeler or an accidental DBA. So here's what I have to say about keys. Keys are key, and not a pun, but you're going to hear this a lot throughout, so maybe we'll treat it like a pun. Keys are key to performance and data quality. So so much of what I read about choosing primary keys and clustering keys and partitioning keys, and the whole discussion is about performance and how important it is to choose the right key for performance reasons. But if we step back to the whole reason why relational database exists, it was because of data quality and to ensure that we didn't have a lot of file corrections. So, for instance, in a network database that existed before relational, if you lost your pointers to everything, you lost your data. So those things happen. There are some of the concepts we're going to talk about have many ways to implement them in a database design, so in your physical model. So I'll touch on those. Not all data modeling vendors use the same term in the same way, and not all database vendors are the same either. So then I get a little snarky here and say if only there was a profession, like I don't know, the data management profession, that's standardized in giving things standard names, kind of like what we data modelers tell the business that they need to do in order to move forward and love their data. So today's focus, there's a whole lot of other uses of the word key in IT, in development, even in the database world. But today, because we're mostly talking about data modeling, we're going to focus on the use of keys and the terms in entities or tables, so logical and physical, on attributes and columns, and some of the database objects that implement or enforce key-related stuff, and that's my technical term, is stuff. But before we move forward on that, I needed to give you some polls. So I need to open a poll about who you guys are. And I realize you might be something other than the choices I'm able, because our number of choices here is limited. But what do you consider yourself, a data modeler, a DBA, a developer, some other type of data professional or other? So it looks like about half as much to figure out who you are. Oh, now you're doing really well. I'll just give you just a few more seconds. Okay, 15 seconds. And as that's finishing up, then I'll try to share the results. I hope you can see the results, because I can't see what you're seeing. But it looks like about 35% of you consider yourself data modelers and 2% DBAs, 6% developers, 19% other data professionals. I'm sure you are all data scientists then. And then 12% other and a quarter of you still aren't sure why they pay you or why you go to work every day. That's great. This is about the distribution that I would expect on a webinar called the Heart of Data Modeling, with about a third of you, a little more than a third of you being data modelers. If you have another data professional thing and you want to throw it into the chat so that we can see, as some of you have already done or done in past ones, then we can see what your titles are. So the next poll is who decides keys? And I'm being intentionally vague here. Like, who gets to pick what the keys are in your logical data model? Is it your modeler slash architect, so data architect, data modeler? Is it the DBA? Is it the developer? Remember we're asking this about logical data model. You're not sure or it depends. So you're going to have about 30 more seconds to answer this. And about half of you still haven't answered and you have about 10 more seconds to answer the results. So it looks like about 55%, so a little over half of you, the modeler architect decides what the keys are in a logical data model, about 2% of the DBAs, which is interesting because we said from the previous poll, about 2% of you are DBAs, so that's interesting. 4%, say the developers and the 11%, say not sure it depends and 29% of you didn't answer. So some people are given there. It's a team effort with the modeler architect gathers the information. Yeah, you know, almost all design gets rare on an IT project for there to be a designer who just is the decider and just goes in and makes decisions without knowing all this. Like any good design decision, everything comes down to cost, benefit and risk. So what the right column or the right data type for a key is, you know, really depends on how much data is going to be in that table. It depends on whether that table has child tables. It depends on how much data and growth and capacity that you're going to have, what your shop standards are. So yeah, great answer that it depends on that. Okay, perfect. And one more poll. Who decides the keys in a physical data model? And so this is really basically what goes into your database, but still part of the model. This is the one I'm really interested in because I think in different shops I've really seen different standards for who's the decider on a lot of things in the physical model. And I think, you know, how one votes on this is really going to be dependent on who does the physical model in general. In some shops, logical stays with the data modeler and physical belongs to the DVAs, or people have a logical model and they don't do a physical model. So the poll is about to close itself in about 10 seconds. Yeah, so we're seeing a little bit more, I'll call it physicality to this answer. So about 30%, a little less than a third model or architect decided the keys in the physical data model. 15% of the DVAs, 16% of developers, and 13% aren't sure, 27 didn't answer. So now you know what the next poll is going to be is who decides in the database itself where keys are. And again, this is going to really play into are you model-driven development? Does everything that happens in the database get applied to your model first and then applied to the database? Do you have models that are just inspirations for the people who built the databases? Do you do a little bit of work on both and eventually migrate it back to the model? I mean, we've done presentations in part of this series about how we approach it and I think this would also have a big impact. So you've got about 15 more seconds to answer this poll. And I think some numbers that I find fairly interesting so I'll just wait for that to clear so that I can apply it. It seems that about a quarter of you, data modelers and architects still decide what the keys are in the physical database, but now 23%, so probably statistically about the same number of people, the DVAs, the decider in the database. And because these numbers are different, it says to me that there's a separation of data between the data models and the databases. 9% of you, it's the developer. 8% aren't sure. So it seems like we're more sure about these answers in the database. And 35% of you didn't answer. So that's interesting. Let's see what you said in the chat. You get what you're given. So yeah, a lot of task force, we would answer differently for package and vendor applications or for people who are doing purely production support and don't have either design authority or design experience. Someone's asking, shouldn't the database follow the model? Well, I think it should because I'm a model-driven designer, but we don't always get that. I'm going to hope, you guys are going to let me know in the chat whether you're seeing the slides now. Okay, good slides. So today's focus, I said we're going to talk about things really inside the database that I think of as the data model managing. So there's one more thing. The keys, and again, I'm being vague here, they have a couple of roles. And there's a sort of conflicts of opinion between developer designers or architects. Keys form the part of the semantics of our models, the meaning in the data. They are used not just in looking stuff up, but in forcing constraints on data quality. And primary key selection is not just about performance, although most discussions I see on designing keys and modeling keys focus only on primary and foreign keys plus how to choose them so that your database goes fast. But the reason I say it's about semantics is if we go back to Tech Cod's paper, the whole concept of normalization, and we did a webinar earlier in the year about normalization, it's all about keys. Like first, second, third, normal form, the questions about what normal form you are are all based on the keys. So if we use an approach where the keys are all meaningless, the normalization kind of becomes meaningless as well, or can be. In order to do normalization, we have to understand the meaning of the keys in the columns. And everything goes to heck when you're trying to do normalization when you have a surrogate key, an arbitrarily assigned, meaningless, every row deserves its own key. It doesn't mean surrogate keys are bad. It just means that your questions about normalization are a lot harder to answer. So in fact, you may have heard the Tech Cod saying of every fact in a table is either part of the key or it depends upon the key, the whole key, and nothing but the key. So help your cod. The reason I prep this up again is you can see this sort of nice memory technique for third normal form is really based on everything about the keys. I alluded to this in the beginning. One of the interesting things about talking about keys is that we have this kind of set of terms that I say I'm going to call it used conceptually, to talk about concepts about keys. And then we have a whole other set of terms that are used physically, I mean that are either objects, features, settings, configurations inside a database. So terms that are used conceptually, primary key, alternate key, composite key, super key, candidate key, surrogate key, those generally don't have a setting or a feature in a database. Those things, except for primary key, aren't really set in DDL and you don't alter a table to change it to all you do something else. So you have a different way of physically implementing them. So some terms used physically, primary key, clustered key, or clustered index, encryption key, partitioning index, identity. When we try to talk as a team about these things, just like, you know, in dealing with rows up and down the Zachman framework, is that we often lose things in translation, because I might use the word surrogate key in a discussion and the developer I'm talking to automatically jumps to the fact that I'm talking about a sequence. I'm not just talking about the concept of a meaningless key, I'm talking about a sequence, but I'm not. I'm talking about a surrogate key that, as we'll see in a minute, I might have a lot of different ways of implementing it. Some of the things that we're going to talk about in more detail, and I'm going to have these sort of bunched up, grouped together differently. But these are some of the words that I've heard that are either associated, we have the word key after them when we talk about things, or they're used in the context of talking about how we might use or implement or enforce it. So definitely primary and foreign key. If you work in the database world and in the modeling world, I would expect you to understand what a primary key is and what a foreign key is. Technically, logical models, we're supposed to have primary keys. They had identifiers, and an identifier and a level could be a column, a set of columns, or it could be a relationship, a set of relationships, or a combination of relationships. So if you're experienced as I am and remember in traditional information engineering, there was that. But in most of our modeling tools, they refer to the identifier as a primary key. So I see we've kind of used a physical term in the logical side, which on the previous slide, that's why you would have seen them on both sides. Also the ways that we would talk about keys, we might talk about business, natural, or logical keys. Those, I don't think, have really distinct agreed upon definitions. But typically, a business or natural key would be how humans might think about identifying a particular instance in an entity, or for those of you who live on the physical side of the world, a row and a table. And business keys are the perfect way to do data analysis that you do as part of requirements analysis, so the data part of requirements analysis. Now, deviating developers are unscreaming into the hills when I even use the word business key because you think I'm going to force them to implement it. So some of the hardest to decide upon are things like what's the identifier for a person? What's the natural key for a person? Well, there aren't any. And so we go through all these struggles on the business side to do it. But basically, business keys or natural keys or logical keys are those that people, customers, clerks, managers would think of in identifying an instance of something. And they're often made up of multiple parts. So for instance, you know, city, state, country, region, area, neighborhood, all of those might come together to identify at a natural level the location of my house. So yes, we have something called addresses and we might have postal addresses versus physical addresses. Those are, that's our number one example of a key made up of multiple pieces of data that can change over time to use to identify a location. That's a business key. But then we come up with, you know, database and the original relational model has absolutely no problem with that. In fact, it didn't have the concept of surrogate keys. So a surrogate key is something that we, on the system side, the IT side, think about as a computer generated, meaningless, assigned, usually a number, but there's no reason a surrogate key has to be a number assigned to identify an instance of an entity or a row and a table. Surrogate keys are the, is this one use of a word that I think we modellers use and we're not clear to the people who we're talking to what those might be. And then we have candidate keys. So a candidate key is a column or a group of columns that meet all the requirements to be a primary key, but the reason we call it candidate is there could be more than one set of those. So you could have different sets of columns that could be overlapping, they could be completely mutually exclusive that are candidates for being a primary key. So that's a candidate key. Super keys are back into the concepts of data modeling and relational database design, kind of like candidate keys. They are the collection of all the keys that could be keys, and sometimes people use this term super key to mean the smallest set out of all those. I've seen that now as I did some research. That's not my understanding of it. And then we have composite cluttered and compound. That's just different ways of saying keys that are made up of multiple parts. So cluttered is interesting to me because the first time I heard it, I thought the person met clustered, which is a whole other fun stuff, but that was just a DBA showing his disdain for multi-part primary keys. Let's see, alternate keys. So I use this one all the time. Alternate key is a column or a set of columns that are candidate keys that we did not choose to be our primary key. And the most likely reason why we want to do this is we probably assigned a surrogate key to it, to a table. And the surrogate keys are usually by design unique, even though we might have, if they're in the primary key, those databases, if you put a column or set of columns as the primary key, it will enforce uniqueness. The alternate key is what we need to assign to identify is a column or collection of columns that uniquely identify an instance so that we can maintain our data quality. So that's an alternate key in the modeling world. Now the way we physically implement those on the database is typically assign a unique index or a unique constraint because we're telling the table this column or collection of columns together must be unique across all the instances in the table. And by the way, as I'm talking, I'm using columns and tables. I should be, in some cases, be using entities and attributes, but the reason I'm using columns and tables to talk about this is I'm also talking about how I implement it with a unique index or a unique constraint. And there's some more on there we'll talk about. So I talked about business key, logical key, natural key. I've also had people refer to these as evil keys and cluttered keys and the wrong keys and broken keys. And one of the reasons for that is that the ideal primary key in a physical design is something that's mandatory, which means there's a value for whatever our key is in every row. And it's unique. So by that nature, the keys should be unique. And some other things I'm going to show in another slide. Business keys, like I said, there really is not a natural key for a human being. And when I do my contentious issues, things, you know, people say DNA and I'm like, great, you're going to collect the DNA of all your customers or all of your employees. Or you can't sell someone a jug of milk until they've had their DNA sequence. That's not going to go over well. But the business comes up with these keys that kind of sit halfway in between. So a typical business key that isn't really so natural and might not be so logical, might be something like an account number. An account number, you know, might have all kinds of meanings shoved into it, what I call metadata stuffing into the thing. So it might have the first letter and it says what country the customer was living in when they joined as a customer. And the next three numbers mean what region that customer's in and then the rest of the numbers and the customer number, you know, represent just a sequence to make it unique. That's a business key. Business people are notorious for coming up with these overly complex, nice set of rules so that you can look at a key and tell all this information about a customer. The problem, the trade-off on that is for those of you who think about it being used as a primary key, is a lot of that information can change so customers can move, they can no longer be a customer and come back and maybe we reuse their number, like all of those things. So while business and logical and natural aren't really the same concept, people tend to use these things interchangeably and they kind of mean the same thing. They mean usually wide, alphanumeric, maybe even symbols and spaces and dashes and special characters, business the way the business wants to identify something. When we get closer to design, we come up with more terms, but these are still, for the most part, logical terms. I've talked about all of them from our word cloud. But the ones that I don't see super keys used, that term used on a lot of real-life projects, but it's out there. Candidate key, most people outside the data modeler and maybe the DBA, weren't part of the process where we identified the three or four collections of columns that could lead to be the primary key. So they're not really thinking about that there might have been four different ways of identifying this customer. But they definitely will know the people on the team, the primary key, and even the end users might be exposed to that because we are saying this is how we know one customer is a different customer. And the alternate key. So Karen's rule of design is if you put a surrogate key on a table, there should always be an alternate key assigned unless the table is something like a long table and you just don't care to enforce any uniqueness or mandatoryness to that table. So these are the types of tables and entities that I call system tables. They're not usually exposed to the users. They don't get updated. In theory, they might never have a surrogate key because these would be heat. These would be things that maybe we don't care about enforcing any of that, but that would be rare. So the first thing I want you to do after this webinar is go pop open your data models and try to figure out do you have any tables that have a surrogate key that have no alternate keys, no unique set of other keys that uniquely constrain an instance of that table. So I said that primary key criteria are applicable to all instances, meaning mandatory and unique. And then a lot of people will throw in, in their own presentations and blog posts, it has to be stable. It can never change and it needs to be small. Those last two things, they are important, but those last two things are requirements for performance. So they're not part of the original requirements for just identifying something semantically. But a good modeler needs to understand that if you choose primary keys, especially in your physical model, that will change when the underlying business data changes and that aren't very small, you probably will be disinvited to the implementation party because those last two things are important to physically implementing those. So you could choose to have really wide, very compound composite logical keys or business keys in your logical model. But once you get to the physical model, you want to choose keys that are also stable and small. And I could go into a whole other presentation on why they need to be small. The reason they need to be stable is if your primary key has to be updated and it's also a clustered key, which we're going to talk about in a minute, and that table, your table also has child tables. So that primary key migrates down the relationship to be a foreign key. All of that large cluttered, as my friend called it, key just starts spreading everywhere and it impacts performance everywhere. So I said cluttered key, composite key, compound key. I know that I've heard many other things to this. This just means a key, either a primary or a candidate or an alternate key that is made up of more than one column. So I will have a lot of people tell me it's not allowed in their database to have a multi-part primary key. And so either they're really misinformed or they're fibbing to me, so it's good to know the actual rules of your database management system. Some of the times the reason they tell me this is because they're actually using development tools like an ORM, like Object Relational Mapping Type Tool, like Hibernate or other one, that assumes or needs every primary key to a table and the way they access tables to be a single value. That, I think, is a lousy reason to design your database a certain way, especially if you end up having to assign a very large, wide primary key when a two-part, very small one would actually perform better. And there are some tricks to getting around to that, is you can have them concatenate the multiple-part keys from either their application or from the database as they run it through the ORM. But most of the time, in most database designs, we're going to end up with a single column surrogate key, so this isn't going to come up as much. So some other terms. Surrogate key, which I've identified, is a computer assigned system-managed key, and it can be identified either by an identity property or an identifier data type, a sequence object, so an object in most relational database systems called a sequence, which is basically a box of numbers that you can go to to populate a table. A GUID, which is a globally unique identifier. These are very wide. You will have seen them in your life in error messages sometimes or in a database, and they have their pluses and minuses, so I'm not going to go into a lot into. Most performance people are not happy with GUIDs because they defeat that one sort of requirement with a star, I said, of a very wide primary key. But one of the benefits is they are globally unique, so they are used a lot in modern development methods, even when they hurt performance. And then I just have custom here. So this is what we used to call roll your own, and we used to do a lot more of this when DB2 didn't have the identity property or identifier data type, and it didn't have sequences. So if we wanted to have a surrogate key, we had to write our own little app, and we had a table that worked kind of like a sequence that took care of what was the last number used, or we had code before we inserted that found the max value in the table and added one to it and tried to insert it, and it just kept trying until it finally did. I mean, there are all these custom ways, or custom ways of implementing a surrogate key. The key to surrogate keys is if they have any meaning in them at all, they're not surrogates. They're just some designed numeric, usually, data type. So the identity property, so again, I pulled this from SQL Server, but most of them work this way. I just set a property on a column, and I say it has a seed. That's the number it starts with, and here's what it's incremented by. So in SQL Server, this is a property of a column. So you choose a data type for a column, and let's say customer ID, and you say I want it to start at one, and I want it to increment by one. So every time I insert something, you don't tell the database to put a one in that customer ID than a two. You just insert all the other values, and the database engine takes care of keeping track of what number should be inserted into the primary key column. Now, those numbers down at the bottom of the screen represent the either max number of values or the highest positive value, depending on what data type you chose. So tiny int, small int, integer, and big int. And in my taking time on this presentation, I talk about why these blow up, and the read properties blow up, is that you get to the end. So if you start at one and increment by one, and people do this with things like customer numbers, because no one wants to be negative 666, like customer negative 666. So they start at one, or worse, because no one wants to be customer five, or we're a new business, so we just like checks in the olden days. We want to start at customer 10,000. We might start customer ID at 10,000, and increment it by one. But if you chose small int as that, you need to hope that you're not going to have 50,000 customers because you're going to run out of numbers. The opposite issue to this is that these numbers, even though the computer is managing them, using an identity for this can lead to gaps because of failed transactions, rollback transactions. People can receive them. You can end up with big gaps. So if you're using this for transaction IDs, where in theory your auditors don't want you to have missing transactions. If you use the identity property and you end up with gaps, you're also going to have to have a process that runs, that back fills in, and starts assigning transaction numbers where those gaps are. The other biggest myth about identity is people assume that it can only be used on the primary key column. That's not true. You could have a primary key column that is based on something else, and you could have the identity on another column, like a document number associated with that. But in most databases, you can only have one column in a table that is identity. So this is all something that requires some thought. It requires data profiling in order to decide what's your seed number and what your increment is. And if you really don't want to run out of these and have a time bomb go off in your database, you actually want to start with the lowest possible value, because in SQL Server, all of these, except for a tiny ant, can be signed numbers, and that gives you twice as many values before you run out. But you can start at the max number and have an increment by negative one if you wanted. And it's all these things. It could be a whole other presentation on how to use identity. Most people who learn database design were taught to do this pattern. Create your table. Give it a name. Create a column with name ID. So customer table, customer ID. Insert, you know, create your other columns. Make it identity, seed one, increment one. Boom, you're done. Perfect design. It talks about doing it that way. Almost every class I've ever seen does it that way. That's not design. That's cutting and pasting into a database design. And a lot of the pain points that we hoist upon our business users are because we just use identities this way. And then people fail to make it unique if they use it outside the primary key. And they think that identity will enforce uniqueness. It won't. If it's not got a unique constraint, a unique index, or isn't part of your primary key, it's nothing forces it to be unique. And nothing in the database engine forces these not to have gaps. So sequences have been around in a lot of databases. They recently need a SQL server. And they work kind of like identities. You can see the properties you can set for them as they have a name. Unlike identity doesn't have a name. You can tell it what type of data type it has, what you want to start with. So like the seed and what you want it to increment with. But you could also set a minimum value. You can set a max value. And you can recycle them. And you can cache them so they're available. The biggest difference between a sequence and identity. And identity is a property or a data type on a column. A sequence is just an object that lives all by itself. It's like, like I said, a box of numbers. And the numbers, the way instead of just inserting the other columns, what you do is if you had customer ID and you wanted to populate it with sequences, you would say, as you insert into the customer ID column, you just say next value. You put the sequence name and next value. And it would go ask the sequence object, go bring me the next value. And in most DBMSs, it's not going to go to the box. It's going to go to a cache of them. And because there's a cache, it means that there's something that can happen to your server, you might not have gapless sequences either. The interesting thing about a sequence is, a sequence isn't tied to a table at all. It's not part of table design. It's part of how your insert query, how your insert transaction works on your table. So you can use a sequence for several tables because it's not tied to a table. And you might think you're the only one using a sequence, but you have no way of knowing, unless you review all the code in your application. Now, as long as you know all this, then you can use sequences the way they were intended. So some other terms. People will tell me that keys must be numeric. SQL Server DB2 requires numeric keys. True, that's a fit. Their design tool, their development tool might require that or expect it or assume it, but it's not true. Some of them will tell me they always have to be integers. No, they don't have to be. Or that you should always use big integer. Or that you should always use the smallest number that you can. Now, all of those sort of rules of thumb, if it's a rule of thumb, like you should use the smallest number, you can get away, smallest data type that you can. That's a good performance design thing. It may not meet the business need if that key is exposed to users like customer ID. But this is all the things apart of doing the lodging and the business precursors to the physical design. Two things that I'm not going to get a lot into. I wasn't really familiar with the term clustered key. Clustered key normally is the column or set of columns that the table is going to be clustered upon. So I'm used to calling it the clustered index, but it all just depends on, again, what your vendor calls those things, what training you got, how it's implemented in your tool. The clustered key for a table is typically the column or set of columns that determines what order the rows are organized on storage, and therefore it's a really important design. You probably want all the transaction line items to be all together in the same spot on a disk or a drive, and you want them in the same order, however your business order is. And a lot of point of sale, that's the order in which they were captured, but it could be ordered by that plus the product that was on those lines. I mean, there's all kinds of business decisions that impact how the queries are going to act. And there's all these reasons why you want data clustered together for performance reasons that it has to do with cutting down the amount of I.O. and returning data faster and all of those things. Partitioning key is if you're going to partition a table into multiple partitions, you have to say, how do you want the database engine to put that data across those partitions. So one of the common partitioning keys is by date so that you can, you know, just have all the recent data or one month's worth of data or range of values or there's lots of different ways of doing partitioning. Most people assume that who haven't done a lot of physical design that your primary key is always your clustered key. That's not true. It all depends on all the things I said that your partitioning key should be an alternate key or something, but it doesn't have to be. It could just be a start date. It doesn't have to be a unique thing. It can be any of those things. And of course there are nuances based on different vendor products and relational databases. But I threw these in here because they're called keys a lot. And now the other term, index and constraint. Now this is where things get wonky across multiple database systems like Oracle or SQL Server or DP2 because in some systems, you know, uniqueness constraint or uniqueness index is a really important design decision in other RDBMSs. There are just subtle differences and most of the time you just pick a standard way of doing it and then deal with the exceptions as you need to. These are how some of the keys are enforced. So primary keys are typically a constraint. That's how they get implemented. So if you tell your RDBMS or your data modeling tool, it's a constraint. That's what it will do. But things like uniqueness or an alternate key might be implemented as a constraint or as a unique index. And in your data modeling tools, when you go to generate your physical model or your physical model into your database, you're often given one of the options to choose. Do you want to implement these as an index or a constraint or when you go to transform a logical model to a physical model? So these are all important things to understand the nuances for your target database in order to know that those things are being done. So one of the other contentious things are we build all these primary keys and foreign keys into our data models. Some tools automatically go and be helpful and create an index on that foreign key. It automatically does it. So it puts a foreign key constraint but it automatically puts an index or maybe it's a setting in your modeling tools. Other modeling tools don't automatically create an index. So you go and deploy your physical model in your database and your DDA and DEVs are saying, all these indexes are missing. But that doesn't mean automatically creating them is the right thing to do because sometimes the auto-generated indexes might end up being duplicate ones or just unneeded ones because you manually created another index because you knew you were going to be querying those things. Most data models that I work with aren't really clear on what's going on in their data modeling tool with indexes and constraints. And then of course the last one is, I bet you all of you on this call have had a developer tell you, don't bother with foreign keys. Don't bother with all those constraints. We'll just do it all in the application code. It'll be faster and easier. Well, it's not faster and easier. It's because then now someone needs to communicate to them what were all the business rules about uniqueness and referential integrity and all of those things. That application code also implies that this is the only way we're ever going to access the data. And I just don't believe it. The other bad design principle is someone says, we're just going to start a key on every table. Job's done. No analysis needs to be done. Well, I fix a lot of databases that do that because they say, okay, customer, customer ID. And now we end up with 50 versions of Karen because no alternate key was added and no one decided whether Karen was just, every time Karen Lopez, the person showed up or was the customer just someone who showed up, bought a basket full of stuff and had some shift and some put together and rented a truck to take it home? Is that just a customer? So there's still data modelers that need to be involved with the keys. So I think what you need to do with keys is read up on all of these terms, especially in your target DBMSs. So the generic ones, in a good data modeling database design fundamentals book or blog post for the DBMS specific ones, go to your books online or your other online help or your product help and see what the nuances are. Make sure you're using the correct terms when you talk to people. Understand what your data modeling tools do with these, how they create and generate keys and all the parts around them. Learn about all the outliers in your tables and data. You know, learn that. I've been tweeting a lot, all the instances where I find people, like just today someone in Forbes wrote an article about how Social Security number is a great way to identify a person and it doesn't meet our requirements for a key. They're not mandatory, they're not unique. They change on people. They can apply to more than one person by mistake at the same time, especially in your system, because your system isn't reality, it's an abstraction of reality. You might have multiple ones over different phases of your life or because of identity theft, like all of that's happening. You need to learn about all these outliers for uniqueness and keys and everything going on, plus the outliers for how you're DBMSs and how your modeling tool deals with them. And then personally set standards for correct term use. Don't be sloppy. No data modelers should be sloppy for that. So that's my slide deck. I think I have to share, so hold on. I hope you can see my screen. What I'm showing here is just part of the Adventure Works data model from in Irwin. And we've got a table, product model, and you see that it has a product model ID. So in the IDF-1x notation that Irwin shows, it's got integer. Anything above the line is the primary key in a physical model. It has used the identity property. It's also got the Adventure Works model features unique identifiers, which are basically one way to implement goods in SQL Server. And you can see that the identity property is the primary key now, unfortunately. They believe that these Varchar names are also alternate keys. So they believe that they're unique. And I never trust the end users to do that. And that this good is also a second alternate key. So you can see all that happening in one model that we talked about. You can also see product model, product description culture has a multiple-part primary key, or a composite or a compound key that's made up of two integers, which are also foreign keys from other tables, plus culture ID, which is also a foreign key. So that's going on there. What else do we have going on? We've got a two-part one with views. This is a view over here, so I'm not going to worry about it. But the way that you interact with, so I'm showing you one modeling tool, I'm assuming that you have the ability to abstract what I'm showing you in one modeling tool across other modeling tools. You can see that this particular model comes from a SQL Server reference model, so the identity property on a column. I'm saying I want to generate this. I have a starting value of one. I'm going to increment by one. It's an integer. What this tells me here is a designer made an intentional decision. I hope you believe that. That product model ID that we would only use half the numbers available to us, because we're going to start at one, and not the negative minimum value for integer. And we're going to increment by one. So we need to go count how many products we have and how many products we're going to have over time and make sure that we're not designing ourselves into a time bomb here. So that's going on there. Let's see what indexes it has. So I believe in Irwin when you create primary keys and alternate keys, you can give them a name and you can set. And now these indexes, now if I want an alternate key index to be generated, oh no, this makes one index, sorry, is that I have this. This is where I'm setting up. So it has a primary key and it has two alternate keys that can be used. For constraints, it's not really got a lot going on there. So that's one example in a modeling tool of a fairly complex implementation of keys in one particular table. I think in this particular table, if I needed to partition it, I could choose which column, and none of these are really great reasons to partition something, but if I wanted modified date to do it, I could make that my partitioning column, which some people call the partitioning key. I can also make much more complex partitioning keys as well. So that's what we've got going on there. And in most modeling tools, oh, and there's a sequence. So I have a kitty sequence. And I said for the sequence, I want my starting value, so it's a big S, so it's a really big number. I'm not going to start at one. I could start at negative 100, but increment by one. And then I could also set these min, max values, and how they're going to get cycled. There's just a little bit more complexity. But one of the things you'll notice, the sequence doesn't show up in my diagram, and there's no way to say that this table uses this sequence. Or to say that this sequence is used by these tables, because actually your database doesn't know. The only people or things that know what the sequence is are the code that goes and creates updates or the creates or updates data, because it's going to use that sequence to populate a column. The other nifty thing about sequences is I could use a single sequence or a whole bunch of sequence objects, and I could have several of them across multiple columns. So anything I needed numbers for, like ticketing numbers and document numbers and all kinds of stuff, is I could have multiple per table. Not per column, per table. And I could do that. So that's all I really had. Let me go back here. I know we've hit the end of time, the end of our time for it, but I just wanted to do a quick check for questions. Also, there's the interplay of identity value across various identifications and contexts. Yep. Identifiers, identity. We just have to do a better job of talking with each other and everything. While you're showing the layout of the model, I really like the one method I found a while back that talks about using the arrangement of NTPs should be identifying strong entities on top, why dependent on the bottom, and the model more readable. I found it useful. I need to find that link again. So that's what some people refer to as no-dead crows, especially if you're using IE notation with crows feet. If you follow that method, and that is a normal, good method of doing that, and then it kind of goes crazy when you have these sort of larger, more complex, this product model, product description culture is a giant intersectional entity. And it also goes complex when you start hitting 50 entities in a diagram, because it's harder to enforce that rule. But I think, in general, intuitively, I tend to follow that, maybe a little bit of difference on it in that I end up with a submodel, a subject area or a submodel that has independent entities around this eye, and then their relationship lines are coming in. So at the center of your diagram, you have the table that is like retail transaction line item, retail transaction line item modifier event, because it's made up of a whole bunch of concepts. But yes, that's a good layout thing. The reason this layout's here is I recently just reverse engineered a database, and I wanted to make things as readable as possible. I was interested, curious about your view on relationship between data quality and metadata within data warehousing environment. That's a whole other topic. My general rule of thumb on data warehousing is your data quality rules usually get implemented in your ETL processes, and you don't enforce quality in your data warehouse tables, because you're optimizing those for read, and most constraints on quality are optimized for transactional systems, where you're going in and updating cells or rows in a table, and that's why you need a lot of constraints and a lot of RI, whereas in a data warehouse rule, you're optimizing for read, you're making very wide tables, you're flattening things out and pre-aggregated and everything. That's my general thumb on it. So I think what I'd like to do is go ahead and end the actual recording in the formal part of this, and I'll stick around to see if we have any other questions or insights in the chat. Karen, thank you so much. Yes, Karen, thank you so much as always. Another great presentation from you and I especially appreciate you speaking well in India in your 12 hours or 11 and a half hours ahead of us right now, at least from the Pacific Coast. Thanks to our attendees for being so engaged in everything we do and all the questions that you've asked. I will go ahead and turn off the recording to start the unofficial conversation. Thank you.