 Hello and welcome. My name is Shannon Kemp and I'm the Executive Editor for Data Diversity. We'd like to thank you for joining this month's installment of the Data Diversity Webinar series, The Heart of Data Modeling, moderated by Karen Lopez. Today Karen will be discussing normalization. It's not your friend or your enemy joined with guest speaker, Carrie Tyler. Just a couple of points to get us started. Due to the large number of people that join these sessions, you will be muted during the webinar. For questions, we'll be collecting them via the Q&A section. Or if you'd like to tweet, we encourage you to share our highlights or questions by Twitter using hashtag heartdata. As always, we will send a follow-up email within two business days containing links to the recording of this session and any additional information requested throughout the webinar. So for this particular webinar, it will go out by end of day Monday. Now let me introduce our speakers for today, guest speaker, Carrie Tyler, and primary webinar series speaker, Karen Lopez. Carrie is a senior BI engineer with a consulting firm in Nashville working with all layers of the Microsoft BI stack. She has over 10 years of various IT experience that helps him get through every day and every problem one way or another. Within the last year, he's begun presenting at SQL Server related events focusing on the analytical parts of the tool set. Karen Lopez is a senior project manager and architect at InfoInvisors. She has 20-plus years experience in project and data management on large multi-project programs. Karen specializes in the practical application of data management principles. She is a frequent speaker, blogger, and panelist. Karen is known for her fun and sometimes snarky observations on data and data management. Mostly, she just wants everyone to love their data and you can follow her at data chick on Twitter. And with that, I will turn it over to Karen to get us started. Hello and welcome. Hi, Shannon. How's your day going today? It's going now that we've got everything running. Yeah, thank you. Thank you again, Shannon. You rock. So I want to thank all the attendees here today for joining us. Normalization is just one of those topics that just the sound of the word doesn't make me really happy. I don't like talking about it. I don't like teaching it. But there's so many myths out there and it causes so much pain and suffering for both modelers and the people who have to consume our data models. So it's kind of risky that we're talking about this topic on I'm in Southern California today, on a beautiful day in Southern California. But I think that between what Kari and I are going to talk about, one of the reasons why I invited Kate to talk with me today, besides the fact that he was in the room when I asked for volunteers and he probably just got, was the only person that cost at the same time or something. Kari and I kind of have a unique background. We both speak in the SQL Server world at SQL Saturdays and conferences and things. But another interesting thing is that Kari and I graduated from the same program at Purdue, but let's just say a few years apart. Kari, you know, Matt, how many years apart? Oh, decades, right? A few. I see. I hear three. Three? No. So, I mean, not that in itself that that's really important, but one of sort of the takeaways that I'd like for people to have about this is that, you know, I'm going to talk a little bit in a few slides about normalization, a brief overview, and where it started from, was that every time I see an introductory class or a workshop or a single slide on a speaker's thing, you know, this was something that Ted Codd started to develop in 1970. So, we can, you know, just barely the 70s. And yet here we are in 2015, all of those decades later, and it's still something that we're dealing with with relational databases. And a lot of the myths and misunderstandings and a lot of important points are the same over the years, but I think that with the advent of NoSQL and with some other non-relational technologies like XML, a lot of understanding of it's been lost. So, we're going to talk about some of those things. The other thing is that, as Kari says here on his slide, he used to fly little airplanes, but he also, you know, works in BI for the data warehousing ETL world. And there's often this sort of misunderstanding about normalization in those worlds. So, I'm going to make sure, because my work is primarily in transactional processing, that I have someone who could try to keep me honest about how normalized things need to be. Also, Kari's going to be watching the chat and the Q&A. And so, he might be interrupting me at times. And he's gotten permission to do that. Not very many people have that. So, I've been doing this for a while as Shannon said. And one of the reasons why I'm a fan of normalization is that I want everyone to love their data. And I can't wait to start talking about the things that people have told me about normalization. But let's get to know our audience, okay? Who are you? And Shannon, that's your cue. Let's start the poll. Oh, and I won't be... Shannon, you'll have to talk us through it, because I won't see it. Not a problem here. Oh, let me take it out. There we go. There we go. The poll is open. Oh, yeah. I get to ask, can I take the actual poll at this time? I do get to see it, actually. Who are you, Karen? I know. Everyone says, I'm going to choose all of them. So, this is who you are, what your mama raised you to be, what you think in your heart, not necessarily what your number one job is right now. And then as people that are voting, Shannon, whenever shut button is good for me. All right. Yes. So, the poll is closing in one second. There you go. Now, it's compiling the data. Let me push it out here in just a second. I never know. All right. Now, everybody, it looks like... Yes. Yeah, I can see it now. It looks like we've got, you know, some data modelers and some DWBI people, about seven developers, 31 who said there are other architects or modelers. So, it looks like there's more people who marked other than data modelers. So, that's really interesting. 23 people said they don't fit in any of those categories. And, sorry, 23 people, and 35 who chose not to answer, maybe because they didn't like those things. So, in the chat, you guys want to record, especially if you said the others, what you are, that would be really good too. Okay. We can go ahead and go to the next poll, Jen. All right. I am opening it now. There you go. So, the question is, how do you personally feel about normalization? Do you think it's the best thing that ever happened to databases? Is it the friend of yours, Miss, or kill it with fire? And, Karen, we had a comment come in that so many, John is a data governance analyst. And it looks like a lot of people who marked other in data governance are data analysts. Oh, okay. Yeah. I sometimes put under data modelers, data analyst architects. Oh, I am seeing some of the chats. That's cool. That looks like the poll closed. Yep. And it's doing its thing. Let me push it out here in about six seconds. Here's the results. Really, only two people voted kill it with fire. So, 18 of you of the 140 think it's the best thing that's happened to databases. 51 think it's a good friend of yours. 22 are like, yeah, either way. And 47 of you chose, how could you choose not to vote on how you feel about normalization? So, Jen, go ahead and fire up the next one because it's going to look kind of familiar. And the reason I wanted to ask about this is that, you know, it's important that we understand, sort of, if we talk about friend or enemy, we need to know how people are feeling. So, now the next poll that's open is, how do your teammates feel about normalization, either in your models or your database designs? And it's the same answers. Best thing that happened all the way is with kill it with fire. This was kind of interesting for me and maybe some other people here too because it kind of depends on what your definition of teammate is. Yeah, I'll help that. So, whether or not, for me, whether or not we're talking about my fellow BI people and directly on my team or on a larger project where I'm working with people from another vertical, we are SharePoint, vertical, or our .NET, you know, development side, we'll dictate how they feel about that. Yeah, that's a good point. So, when I think of teams, I usually tend to think of development teams. So, the polls closed, so we can go ahead and push the results. Should be out there. I mean, if not. And Karen, we've got, there's a couple questions from people commenting that we've got a DBA that even fights all the time with a surrogate primary key and one comment on teammates, their teammates are developers and end users, and that's kind of an interesting angle to take that with end users. Cool. So, we're going to talk about surrogate keys and normalization in a minute. So, people think that their teammates think that only 15 people said that they think their teammates think normalization is the best thing that ever happened. So, I wonder if they're talking about their own groups, like other models, architects and design database people, or in the more extended team. So, 29 is my friend, 42 are like meh, and 12, so an extra 10 people that they kill it with fire. Okay, so that's the end of our poll. I think we've got those things. Yeah, we talked about those things. So, what I want to talk about today, so now we talked about who you guys are. So, we have a good mix of people who are, who consider themselves data modelers, and we have only a few things, but, and we had some people who feel great about normalization and some people not so great about it. But why this topic? We'll talk about that in a second. And then, why do people normalize, why are, why do databases need normalization or denormalization? Some myths and truths and some tips for some. But I have to make this great confession. So, I've been data modeling for almost 30 years in database design. That's a really long time, almost as old as relational databases have been around, really about the same time. I never normalize. I'm going to explain that to you though in a minute, but that's my confession. I don't have the normal forms memorized, nor do I plan to. Yeah, like the first three, sure, those are easy. I don't care much about the nuances between Fort Normal Form and Voice Cod and whether or not there's really a sixth or seventh normal form or tenth normal form. My pro tip for you though is, there's a trend. If you want to create your own thing, normal form, you get to name it after yourself, just like planets or stars. You no longer have to name normal forms by their numbers. See, this is where Cod did that wrong. You didn't buy numbers instead of calling like Ted one, Ted two. He called it first normal form and second normal form. I do believe though that the understanding of the principles of normalization is very important if you're doing design. I don't believe normalization is a pro-sauce. I'm going to talk a little bit about that in a minute. Not everyone in the beta world is going to agree with me on these things. I would so love to hear your comments or see your comments in the chat about whether you agree or disagree with these. Or on Twitter. So why this topic and why not this topic? I would also point out I can never type the word normalization correctly the first time ever. That's part of my love-hate relationship with it. If any of those ended up on the slide, I apologize. One of the things about normalization, it's a love-hate thing. I do a lot of data modeling training which of course involves teaching people about normalization. I love that I'm teaching people how to love their data and normalization is part of that. I hate the fact that I have to teach them how to do it. And we talk a little bit about how normalization is always taught wrong. I work along with Graham Simpson with his training materials that are derived from his book Data Modeling Essentials Third Edition. And Simpson and Witt's approach in their book and in their materials is to teach normalization first and then data modeling second. A lot of courses teach data modeling and normalization all kind of in the same thing. And I was really resistant to that initially. I really thought, again, that love-hate relationship I have with normalization is that I thought that focusing on normalization first was just drudgery and everything. And now that I've done this course quite a few times, I'm going to appreciate the fact that getting people who are new to modeling and databases to think in terms of good normal forms and why we do it first before we start teaching them about crow's feet and boxes and lines and dotted lines. And all of those things is probably a good way to teach the fundamentals first. But the love-hate thing that I have with the word and teaching it and even having to think about the normal form is the rest of the world, people who don't wake up every day, like people like us who wake up every day thinking, I get to solve another data problem. I get to tell another data story. I get to create a new report or data visualization. Carry your jobs like that, right? You get up every morning just ready to go fix another data problem. I at least get up that way. Some days I get beat down, but I at least start out that way. Excellent. So I created this meme quite a few years ago is that the part about normalization that everyone hates is it means in order to get your data back out of the database is that developers have to write queries that do join. And there's this whole mythological universe someplace out there that I've never been to where somewhere people can make non-redundant database designs that are fully normalized but you never have to do a join and I haven't yet to figure that out. So I swore when I submitted this abstract to be our webinar for today, I would not actually talk about normalization and then I realized I really have to. But this is not a how-to on normalization. I'm going to be very brief about these things, but I want to talk about why we do, why normalization is an important thing for modelers and database designers to understand. So Ted Codd, Dr. Codd invented this as part of his papers on the relational model to address update anomalies. I'm going to see a little bit about that in a second. But basically, and this is me paraphrasing it, that in a normalized relation, aka table, even though I know some people don't like to use that as an equivalent term, but I live on a project where we design tables and we don't design relations. So I'm going to, from this point forward, I'm going to use database terminology to talk about normalization, and I realize how blasphemous that is and how wrong it is, but I'd rather it be more practical discussion than an academic one. So normalization gets you to a place where you have one fact in one place and then you reference that instead of duplicating it everywhere. That means that you have one fact, one time to create it, one place to update it, and one place or one item to delete for that. So in a quote, this is from the Wikipedia article on normalization, and if you've ever had spare time, you could go find all the data modeling and normalization entries on Wikipedia and see all the editing wars that happened in the back and forth between them. But what Ted Codd said is that normalization was created to avoid undesirable insertion, update, and deletion dependencies. So the update anomaly is how people refer to it. And this, oddly enough, to reduce the need for restructuring the collection of relations as new types of data were introduced, thus increasing the lifespan of application programs. Now this particular goal of normalization will be quite shocking to the people who tell us that relational databases are fixed schemas and unable to change them. They're inflexible. Normalization was created to make schemas more flexible. If you think about it, if you designed a data store that had just everything about an invoice all in one row in a data file, so you had the invoice header information, I think about a common delimited text file with invoice header information and then all the invoice lines all repeated, all within that same row, and all the product information and the pricing information and the text information. That was all in one really wide record. Then your application card would have to know how to parse that long row of everything. And so one of the things when Dr. Codd, when the relational model was designed was that if we separate things out so that facts that are highly dependent upon the same identifier are kept together, we could add other sets of facts and link them up to everything without breaking everything we built before. That sounds pretty darn flexible to me. The third point, make the relational model more informative to users. That's an interesting way to think about it. I think it goes to the fact that if you put data facts together that belong together, they're much easier to understand. So when you mix product and invoice dates and prices and office locations all in the same concept, that's harder to understand. The last part, which is really important part, and this gets to the heart of a lot of the SQL, meaning relational and the non-relational parts of it, is that we design structures in the relational database that are neutral to the queries that are going to happen to them. Now, one of the reasons why we want to do that in a large enterprise complex system is that we rarely get the opportunity to design a database that is only going to be used by one function or one application. We design enterprise databases normally that are going to be used by all kinds of queries, all kinds of applications. You know, take your customer entity or your customer table. We don't build a separate database. Well, we're not supposed to build a separate database or a separate table for every application. That is both a plus and a minus for relational systems or enterprise relational architectures. The reason it's a plus is that we design these things in a way that makes them highly reusable that aren't overly optimized for one query, because when we optimize a structure for one query, we automatically de-optimize it for every other query that would use it. That also means, though, that query then has to do, the query and the database engine has to do a lot more work through the joins, through aggregating things, through creating calculated data. It's a trade-off. But if you go by Karen Mantra of every design decision comes down to cost, benefit, and risk, you can see that we're trading off some cost and some benefit. Then we learned in the data warehouse world that we could design things that were optimized for certain sets of queries. We've also learned over the years in relational database that sometimes we do end up designing structures that are optimized to certain queries. And the perfect example of that is why we create views and materialized views. We create those now either to simplify someone's understanding of the data or, in the case of a materialized view, data that has been pre-joined and pre-aggregated and pre-calculated so that we can pay the price one place and then derive all the benefits in many places. I think that's why we came to realize that even in the relational database world we needed a data warehouse or business intelligence model. So, Kari, this is where you work more on that side. Are all of your structures highly normalized? Due to the specifics that I work on a lot, the answers actually no because my tools tend to feed or my data models in lots of cases tend to feed into top-end OLAP systems. The two flavors of Microsoft SQL Server for example, and those tools are made to consume star schemas much better than anything else. So, the final data models that I wind up with doing a lot of these days are actually pretty denormalized because that's what the reporting tools are worked best at or work best with. Right. And so, even that, I mean, you said star schema or snow schema, whatever people are doing in their data warehouse world. I mean, it's a contentious issue. I know there are a lot of people who say they still want to put highly normalized structures. But this is a case where we, usually for performance reasons, we've chosen not to have normalized data. And one of the reasons we can do that is because we rarely update the data in those structures. So, the updating of the data typically happens in a transactional system. And then we move that data out and now we restructured it and denormalized it so that we can get the benefits. Is that how you see it? Yes, because that's the great point is you designed that to be read-only and you can like the comments you made earlier about materialized views. You move the pain of inserting it to that model to an off-hours, you know, one-time batch process. Right. Or even in the case of the non-sequel world, you know, a completely different system or that's trickle-up rated and not necessarily in batch or, again, something that's optimized for read. So when the non-sequel people quite correctly talk about how normalized models are so painful to query and don't perform, that's because they're thinking of a data story, a problem they're trying to solve, which is consuming data. And yet the relational model was created to solve a problem that existed long before that in network and hierarchical of data being inconsistent because it appeared in many places and therefore got updated out of context of other updates. So one of the things about normalization is it's all about... Oh, go ahead. Sorry, there was an interesting comment or an interesting comment made in the Q&A a few minutes ago before we move on too far. Stacey says, I like my logical data model to be normalized and then denormalize or genericize and abstract in my physical data model and ask for thoughts. Yeah. So that's really common. I mean, that is one of the reasons why we have these two separate levels. The models of both logical and physical is our logical... I mean, I think logical models should be very normalized because we're really... I think the role of the logical model is to store the specifications about data, our decisions about data, and the physical model is about building something that has leveraged off that model but we're building it and we're trying to optimize it for either a specific application or a specific tool set. I mean, that is the reason why we have the two sets models. I will throw in now that most data modeling tools are very cumbersome about having logical and physical structures of the same thing that are in highly different structures. So the greater the difference between your logical and physical, the harder it is to maintain that. Sometimes people just don't even keep track of the linkages and they just have a completely logical model and it's kind of hard to keep track of the linkages and they just have a completely logical model and a completely physical model. The modeling tools have gotten better at that, but it's still a difficult thing to do. So great comment, great question. So normalization. When you go read the rules about normalization, it's all about the keys or the identifiers to use the correct term and yet people don't think of them that way. So every question you have about the normal form talks about the keys and it depends on the meaning of the keys and columns. So I have people tell me all the time, oh, you can just, you know, you can just use a script to normalize all your data. I don't know how the script knows the meaning of department name versus department description versus facility description. I don't know how a script would know that and one of the problems that we have when we use target keys is your normalization stuff goes all to heck and when I talk about the normal forms, I'll talk about that again. And it depends, normalization questions or rules depend on the makeup of the parts of your keys, in other words, the columns. So here's briefly, and I so didn't want to have this slide, but so briefly, sort of a description of first, second and third normal forms. So in first normal form, all the instances here rows have the same facts, columns and a table. And there are no, so all the instances of an entity have the same facts and there are no repeating duplicate columns. So typically called no repeating groups, that just means that you would never, sorry, first normal form means you wouldn't have, you know, a table called person and then a bunch of columns for child, child, child, child, child to keep track of all the children that that person is related to. And in one article I read, they said this is kind of just implied because relational model has no concept of repeating groups and therefore all the normal forms would have this rule. Second normal form only applies to multi-part keys that no fact, no non-key fact in your relation is about just part of the primary key. So you're looking at dependencies. So when I say is about, I mean something like the difference between department ID and department name. So if you have department ID, department ID is your primary key, department name would be the non-key fact and in theory it's dependent upon department ID. Third normal form is no fact depends on another non-key column or fact. So that tells you about these dependencies and the dependencies, the word dependency comes up a lot in normalization descriptions is that it's talking about in order to identify one from another one. And most of the normalization problems that I see in designs that I review are because someone has thrown a certain key on a table for a GUID or identity or something like that, a sequence. And now these questions just go away and it makes it so easy because if you think your primary key is one, two, three, four, five, then you're either breaking all these rules or meeting all these rules because you're one, two, three, four, five and you need nothing. So an important takeaway, especially as you're dealing with your teammate is to understand applying certain keys really early in your process means that you're going to make more normalization mistakes and the normalization issues that you have are not going to be apparent. This is why, while I use circuit keys a lot, I apply them as late as possible as I can in the process of doing a design because it clouds all these questions. Is department name dependent on one, two, three, four, five? I don't know because one, two, three, four, five is just a row identifier that someone slapped on this table. I have to go find the alternate key of one of the candidate keys in order to answer these questions. So TEDCOD, the thing that we make about all the way through to third normal form is every fact is either part of the key or depends upon the key, and nothing but the key, and then people throw on so healthy pod just to remind you of taking it out. And our friend Michael J. Sport, who's also in the SQL Server world, he does lots of these comics about data stuff, and this is one of my favorites where he says TEDCOD hates that thing you just did, and I've been known to put those in a slide deck just when I'm doing a model review. This is a good way of remembering that first, second, third normal form depends upon the key, the whole key, and nothing but the key. Graham St. Sands, who was the previous host of a webinar series at State Diversity in his Data Modeling Essentials says normalization is like marriage. You always end up with more relations, so more tables. So this is one of the reasons why people really hate to think about normalization. There are too many tables. So my answer back to them is always, well, how many tables should there be? Like if you can make a scientific conclusion that there are too many tables in this design, then you obviously have an answer of what's enough. Another related question to this, and I see this in a lot of blogs, is that there's some number of joins. So the number I've seen is anywhere from five to seven. If your query has to join more than five tables then your database design is wrong. And I'm here to tell you that kind of generalization has no business in the data modeling or database design world, and anyone who says things like that needs to find a way of saying, hey, why do we have so many joins? Do we need all this data in our query? Is there, what is the reason the data's all spread out on all these joins? That's the question to be asking. And yes, a 17 join query is going to be ugly. But I would say it's rare in the transactional world that we actually need to have a 17 table query in order to perform a transaction. Sounds to me like someone's trying to do some reporting or some ad hoc queries on our transactional system, and as we've spoken before, we think there's probably a better place to do that. So, Carrie, when you guys are moving, getting data out of a source system and getting it ready for ETL, how many has too many joins for that? How many is too many? Yeah. Not as a rule, but where do you start getting scared? I don't know. I don't know if there's a hard number. It depends on how big the tables are and what the lookups look like, honestly. If the old SQL Server DBA answer, it depends. If the TKs are all or a lot of glids or something, I say glids instead of good because I'm weird. If they're all glids, then it's not going to take very many for me to get nervous about how that query is going to perform. But to some extent, it goes back to a point I made earlier, which is since this is probably happening in an off-hours batch type situation, even if performance is kind of rough, I'm probably not going to worry about it too much. Or what I'll do is I'll do it in chunks and use multilayered staging tables to build up what I really need. If it's a really wide fact table, for example, that's coming from... It's pulling data from a whole arm load of source tables. Yes, but a five-table join, it doesn't make the boys cry, does it? No, not five. Five is not the best. Exactly. Yeah, exactly. And yet I hear five is too much for a query. So one of the reasons... Yeah, in blogs too. So one of the things that I mentioned, when people start sprouting these rules to me from normalization myth, is I remind them enterprise applications are really complex. I mean, that model in the lower left there, that's 800 and some tables. My data, no, not every query is going to hit every table, but there's just all kinds of queries. And I try to tell people, get over the fact that you have to do joins, get over those things, and there are things that we can do to help tune a poorly performing query. Sometimes that's denormalization, and sometimes it's not. So I talk about what normalization isn't. So my favorite quote from a conference once was that normalization was created to slow down the read and write heads on spinning disks so that they didn't crash. And I thought that was the silliest explanation of normalization. And someone said that in a presentation, and I thought they were joking, and I laughed out loud, and then no one else did. The thing is, there's so many myths about them. You know, this thing about, if you have to do more than five joins and your database design is wrong, that, you know, you have to take everything through first normal form, second normal form, talk about that in a minute. What normalization is, we had earlier in the deck, is a design philosophy to minimize update, delete, and create fails so that we have consistent data. It is a data quality thing, and if I had time to demo this, it's also a performance tuning thing. So think about if you had a table that had 10 million rows, and it was, let's say, a million rows, and it's a transactional table, and if you had a bunch of repeated information, you've now artificially made that table very much bigger. The indexes are bigger. It's storage space is bigger. Yes, it's just like how white quality was invented. That's how people think of normalization. If you had this big table, meaning lots of rows, and it wasn't normalized, and you had to do an update, instead of updating the data in one place, you could end up having to update it in several places, but that's a trade-off on update performance. Now, most of the time, we focus on query performance. We don't focus a lot as much on insert and update performance, but there are some workloads, some data stories where we do. Why people hate normalization? Well, because they think normalization is the opposite of performance. Our cases where normalization improves performance because you're storing one fact in one place, and therefore that makes your database smaller and your table smaller. Mostly it's taught wrong, which I'm going to talk about. For the most part, it's misunderstood. And then the last thing, heaven forbid, it's not sexy. So someone should tweet that. I agree. Normalization is not sexy. I wish we didn't have to talk about it. I wish that we could teach people data modeling, and I have done this before, teach people data modeling without mentioning the word once. Why is it taught wrong? Or how is it taught wrong? Every course that I've ever seen or every presentation I've attended on it, people teach it as a process. And I think because they learned it taught as a process. And what I mean is they say, well, first our data is in zero normal form. Then we put it in first normal form by removing all the repeating groups. Then we put it in second normal form by removing all the keys and making sure that all the non-fact columns are dependent upon the whole key. And then in third normal form, we take out any columns and move them to another table when it's dependent by another non-key thing. Well, guess what? In the real world, I don't know about you guys. I have never done a data model or a database design that way. I start with a collection of facts. I start with entities. I start with the facts that I need to know about that entity. Oh, well, I need to know the customer's names and I need to know some information about them. And the first thing I'm thinking of, well, a dependent isn't the customer, so I'll go create another table. I never created this first normal form data model. I think we do ourselves a huge disservice by teaching data modeling and database design in this way. I just want people to understand the concepts and then never speak it again. So when I said at the beginning that I never normalize anything, this is what I mean. I do not take things, the first, second, third normal form all the way through. I just model things based on their meanings and the dependency questions. So if I have an entity called facility and then I need to keep track of who the manager of that facility is, I don't put Jane Doe is on the facility table. I automatically know that Jane Doe isn't a facility and therefore she's probably got another primary key and I make sure Jane Doe's information is accounted for in some other table, either employee or worker or person. People use, yeah, go ahead. As someone who didn't learn modeling in class, let's say, I did have one, whatever. How much of that behavior, so I do the same thing basically, but how much of that behavior is because you know what you're doing versus, you know, and by that I mean I'm intimately familiar with the whole concept versus just getting taught or just getting started. Yeah, so there is a little bit of that. You know, I tell people a data model by now instinct, intuition and just having modeled the same thing 100 times before. There's that. But I've actually seen on projects where people force you to produce a first normal form model or table and then a second normal form. And for me, that's the part that I don't agree with. So usually what happens is you model something and then you look at it and you say, oh, you know what, that phone number that we put on customer doesn't go there. It goes over on this other table called contact mechanism, you know, and so we're normalizing out sort of the misses, you know, having put things in the wrong place because we misunderstood it or didn't realize that there were many of them like phone numbers or email addresses. That I get. But I also think that people think normalization is bad because they think you have to, you know, first do normal forms, then second, then third. It's not a process. And people use normalization as a gray. Oh, a fourth normal form. That's a better model. Third normal form is the magic. So I keep hearing this all the time. Our models are all third normal form. Well, take to tell you this, your model doesn't have a normal form at all. Your database doesn't have a normal form. Only table flash relations have a normal form. That's about primary key. It is not, you know, the same table. You could create a simple code table, and because it has a single value primary key and a name and a description, it's automatically in all the normal forms, or it could be. So I see these standards all the time. We will not go beyond third normal form. Well, take to tell you this, but this higher normal forms, which aren't grades, if you don't have a multi-part primary key, you're probably in a higher normal form than third. So people treat the normal forms like a grade. It's not. It's a measurement. It's like saying, you know, we're going to set the temperature in our refrigerator to minus 49 because that's better than 40 degrees. It's not. It's just a measurement. So normalization. Yeah, go ahead. There's a comment here that kind of goes along that, along those lines. Max isn't the voice card normal form the magical, and I'm guessing he means the end goal place to go. But there's more beyond that. Did you know that? There's like more. And so when I say there's a magical one, I just keep hearing, you know, third normal form. We're not going to go beyond that. And I used to have a project manager. Every time I'd open them all, I'd say that's not in third normal form, our standard third normal form. And I'm like, I don't know how to break this table to bring it back to third normal form because I can. So because it's taught this way, people think it's the process you're going to draw them through, and I'm going to have some tips about that. I'm sure. I also think that, yeah, it's like grade. It's like any other thing with the number. People see the number as a grade. Some of the other reasons why people hate data modelers or people who want normalized form is that because there's so many misunderstandings and because sometimes we modelers talk in terms of, it's important because it's not in third normal form. We do ourselves more harm by ourselves talking about normalization. First of all, in front of users, no user cares what normal form their database objects are in. I think we shouldn't really care because in theory, so if we're doing a logical model, so a whole other set of rules for logical models. If we're doing a logical model, I do strive to engineer out all the dependencies, even the higher normal form dependencies. I want one fact in one place in my logical model so that I don't have inconsistent definitions or data types so that I know for sure that there can be many of those and not just one, so I'll just keep using email address as an example. I'd say the vast majority of database designs and data models out there have, you know, just a single email address for a customer or a person or employee and we know in the real world there are multiple ones. Do we put those in another table? Well, we should. If a customer has multiple email addresses, there's probably a reason and we might want to keep track of that. But I also think that sometimes it's okay, though, to entertain the normalization. So coming back to email address, what senior business is interested in keeping track of multiple email addresses for a customer? My guess is the business really, unless you're in collections or forensics or something in security, my guess is your business might come back and say, yeah, let's keep track of three email addresses. Just because we want to make sure we can reach them if we need to. You know what? That might be the perfect time to sully your data model and your database design and just put three columns in. And why would you do that? Well, performance second is that chances are the business isn't going to come after telling you to go from one to three, come back and say now go to 14. I mean, they could because every design decision comes down to cost-benefit and risk. So when you denormalize, you're taking a risk that you understand the nature of the data and what the tradeoffs actually are. But that might be one great example. And in fact, we kind of do that already with things like address line one, address line two, address line three, address line four. I mean, that sounds like a repeating group, doesn't it? And yet, we see that design all the time in our model. So seeing data modelers just refusing to denormalize some things that could pay a big benefit and have a relatively low risk. So some of the tips that I wanted to go through is, I think we should just stop talking about first, second, third normal form, unless we're trying to deal with a really tricky problem, unless the data, the accuracy of the data is so important that we need to understand sort of the underlying mathematics underlying our data to ask those questions. What's it dependent upon? Is it dependent upon anything else? Do we have the right primary key for what we're talking about? So I'd like to see us, even though now I've just done a whole webinar on normalization, to not just talk about it much, learn it and use it and understand it. So I think it should know the normal form and know the lingo. Just don't use it as much. I think it's much more important, if you're having sort of a data modeling or database design debate, to be able to explain the anomaly that you're trying to avoid. So instead of saying, well, that's in first normal form, don't say that, is say, well, you have these three columns, you know, you have these three columns, but they're really the same thing just repeated. And what happens when we, instead of having three of them, we now need seven of them? How are we going to do that? Or you put all this nice information in the next seven. And I think the reason you've done that is just to avoid having to create a new table and do a join. Maybe we should do that and consider performance. I said that normalization is all about the keys. But most primary keys and databases are now meaningless, surrogate keys, and it makes it really hard. So that means you need to understand the alternate keys, which are usually business keys, and what that entity is, and therefore you could decide which attribute belongs to that entity. If you have a primary key that's a surrogate key and you have not identified an alternate key through a constraint on that table, and 99.9% of the time there will be an alternate key that you should be enforcing, then you don't know if you're able to normalize or not. So that means primary and alternate keys and foreign keys should be enforced. I was going to say there, if you don't enforce your alternate key, you don't have an alternate key. That is true. And I should, people have to remind me. So alternate key is not a database term. It's a relational term, and it just means, it usually means what could have been a key, usually a business key, and it's enforced very uniqueness constraints and may be made up of many columns. I think it's also important for data models to understand denormalization patterns. So what sorts of things, if we do have to denormalize for a performance, make the most sense? So I gave an example of a repeating group. Even though that kills first normal for me, I think that just broke everything, but this is the most common denormalization I do. And it is a risky move, but sometimes it has huge payoffs and it has sufficient payoffs that the end users and the teammates are willing to take this risk knowing that we've transformed risk into a performance game. I get these requests all the time in our change management system up. They're going, this needs to be denormalized. And I flip those back to the people and saying, tell me the denormalization you're proposing. You know, I wanted someone to collapse about 12 tables down, sort of header, detail, detail, detail. So one to many, one to many, one to many. You want me to collapse all those into one table, which of course would have met many, many repeating groups of repeating groups. But in the change request system, it was like, there's too many joins, this needs to be denormalized. And everyone was like, yeah, great, do that. And I said, no, I need to see his design to understand what he wants. And he kept pushing back on that. And finally, when he did it, everyone said, no, that's crazy, we're not doing that. So need to be denormalized is not a requirement. It is a suggestion of how to attack a problem at a very high level, which does not belong in a change management system or in a change request. That means that in order to provide a good response to this needs to be denormalized, is that we, modelers and designers, need to understand what other performance tuning techniques could be done to solve a problem. Denormalization is not the first, second, third, fourth, fifth, sixth, or all the way up to tenth thing that I would look at if I were tuning a query. So when people in my teams understand that now, I try to squash all joins or evil discussions unless we're doing a nice mean about them. Because joins are a feature in a relational model in a transactional design. Not so much in a data warehouse or BI design. They're a feature. They are what keep our CIOs from going to jail, that gain us new customers, that ensure confidence in IT and business, kind of this whole stump speech I can do about why joins are wonderful. And I think that we all need to do those too. Now having said all of this, we modelers are not known for being flexible about wanting to sully our data models, our physical data models. That's why we have people as architects and modelers is that we can respond to normalization trade-offs in a way because we can look at cost, benefit, and risk. So are there any other questions or chat items, Karen, if we get near the end? Karen, one thing was said in chat that as soon as I find it again, I thought it was a good question and I know it's right up your alley as well. Iana asked, is there a suggested normalization approach that supports easier data governance? Wow. I'm not even sure I totally understand that question. Are we talking about methods? So if I told you to strike normalization from our conversations. Yeah, Karen, I meant from a technical metadata modeling perspective. Okay. I think the thing is that normalization is about data integrity, data quality. That's why we're engineering out update anomalies, create, delete, and update anomalies. So wherever data quality fits into a data governance process, that's where normalization sits. But I think if we do this sort of natural form of data modeling that I talk about, where the end result is a certain level of normalization, that those are the things that fit in. So from a data governance process, I as a modeler or data architect, I review and approve the design, the models and the design. Sometimes I create them myself, which makes it really hard to, really easy to review and approve them, and sometimes difficult. We know that when people design tables and their job responsibility isn't to ensure data quality, they tend to design highly denormalized, optimized for the query, the exact opposite of what Cod said. And that makes their query go blazing fast, and then it also makes the data very much unlocked and poor quality. But I know that performance and data quality are trade-off. I'm always going to lean towards data quality and protecting the data, and sometimes that works, and sometimes that doesn't. So I know that's a big, it depends answer, but that really is the only answer I can give to that. So that's all I had to talk about normalization. I would so love to hear what other myths that you have heard. Oh, one of the statements that I also quash, and this gets bandied about all the time in the SQL world, is the right way to do normalization is normalize till it hurts and denormalize till it works. And that's so factually incorrect because it focuses on the process, and it implies that normalization is painful and that denormalization feels wonderful. All of that's wrong. So I'd love to see either in tweets or feedback, or you can email me or email Shannon and any of these sort of sayings that people have heard because I'd like to blog about that. But I wanted to thank you all for a great conversation. Thank you so much, Carrie. This was so much easier for me, and you brought the right level of snark and assistance. I wasn't even trying that hard on the snark, honestly. Well, that's why I invited you. Right. Okay, Shannon, it's back to you. Thank you, Karen, and thank you, Carrie, for joining us this month, and thanks, as always, to our attendees who are so engaged in everything we do. I love just watching the chats through the whole process and the questions coming in. Just a reminder, I will be sending out a follow-up email by end of day Monday with links to the slides, links to the recording of this session, and probably a copy of the chat going on since it was so on fire this time. And I hope everyone has a great day. We'll stop the recording here, Karen, for you.