 Hello and welcome. My name is Shannon Kemp and I'm the Executive Editor of Data Diversity. We'd like to thank you for joining this month's installment of the Data Diversity Webinar Series, The Heart of Data Modeling, moderated by Karen Lopez. Today Karen will be discussing the ticking time bombs in your data model. Just a couple of points to get us started. Due to a large number of people that attend these sessions, you will be muted during the webinar. And we very much encourage you to chat with us and with each other throughout the webinar. To do so, just click the chat icon in the top right corner of the screen to activate that feature. For questions, we will be collecting them by the Q&A section in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share your questions by Twitter using hashtag heartdata. As always, we will send a follow-up email within two business days containing links to the recording of this session and additional information requested throughout the webinar. Then let me introduce to you our speaker for today, Karen Lopez. She is the Senior Project Manager at Architect and Info... Ah, let me start that over. She's a Senior Project Manager. I should know this by heart by now. And Architect at Info Advisors, she has 20 plus years of experience in project and data management on large multi-project programs. Karen specializes in the practical application of data management principles. She is a frequent speaker, blogger, and panelist, and Karen is known for her fun and sometimes snarky observations on data and data management. Mostly, she just wants everyone to love their data. Follow her on Datachik on Twitter. And with that, I will turn it over to Karen to get us started. Hello and welcome. Hi, Shannon. Happy December. I can't believe it's December already. And thank you so much for that long introduction. And I'm always impressed with my bio because I forget that I do all those things. So I want to welcome everyone for joining me here today. This isn't our normal time slot because our normal time slot falls on major U.S. holidays in the November and December timeframe. So what we usually try to do is reschedule for early in December to cover both and also to free all of you up for your mandatory company cookie and gift exchanges or whatever it is that happens. I want to highly support the fact that you guys can chat with each other in the chat feature. Please do that. Know that you're writing something on the Internet, but I find that one of the most fun things to go back and review. I try to keep my eye on it when I'm presenting. If you have a formal question or comment for me, please do put it in the Q&A so that I can see it. Oh, you'll notice that Tom's name has a strike through there. Tom isn't able to join us today because he actually had a passport meeting going on starting today that got scheduled at the last minute. That means that then I get to put all the snark and slings that I normally put at DBAs and at Tom specifically. I just get free form ability to do that. That also means on Twitter, you know, you're more than welcome to say that I'm right and that he's wrong at SQL Rockstar, at SQL Rockstar if you'd like to do that. So that's me. I was already introduced. I would love to hear about your comments later now or later on Twitter at Data Chick. There's a little bit about Tom and that he's at SQL Rockstar. The reason I went ahead and left him listed here is that we normally co-present this presentation. So I'm giving him credit for having gone through the arguments with me about some of these and the right way to fix them over the last few months as we give these. But please do Twitter share observations or comments about today's event using the hashtag heart data. I'll also try to keep an eye on those as well. So I talked to you about the Q&A. Use the chat to discuss with each other some of these things. And for some of you joining late, as Shannon said, yes, you will get the slide. Yes, there will be a recording next week, usually early in the week or something. We're on a weird schedule, so don't make me promise things for Shannon. So as usual, I'm going to start this so that I can get to know you. I'm just asking you what your primary job is, not what your title is. It's more about the role you play and the things that you do. So are you primarily a DBA, a developer, a data modeler, some other type of analyst, architect, some other type of technical person? So, Shannon, I'm not seeing any voting. Did I open the poll correctly? There it goes. Okay. It was stuck a little. Okay, now we have people voting, for sure. So the reason this becomes important to the discussion is, you know, today I'm going to put kind of a data modeling spin, definitely a physical data modeling spin on today's discussion. But when we give these to different audiences, really the reactions of their level of understanding varies based on these. And yes, I would love to have a lot more categories here, but we're limited to about five, so that's what I'd like you to do. So you've got like 10 more seconds to vote. And as soon as those come, two seconds. And any minute now you should be able to see the results. So only two of you identified yourselves as DBAs, two as developers, about 50 of you as data modelers, 13 clubs yourselves and other analysts, four said other tech, and 14, no answer. So if you could put those answers in the chat, then I might be able to see some of those. Dev, data transfers and integrations, yeah, there's all those kinds of things. So that gives me a better feel that we're mostly architects and analysts here, along with some other technical people. So then the second poll I have for you is I want to know, since we're talking about sort of the gotchas and alarm clocks that go off in databases, or because of a data model, I'm going to ask you to, if you've ever had a role where you had to be on call, which means answering problems outside of normal times, either in your current role, a past role, or not at all. And yeah, you're voting a few more seconds, 20 more seconds. So it's looking like either a small percentage of you are about 13% of you in your current role in production on call type services, about a little over half of you have been on call in a past role, 15% of you said no, and 20% of you didn't vote. So the reason this is also important is the reason I'm talking about these as being time bombs or alarm clocks that are going off in our databases and data models is that a lot of times we as architects, we're going through modeling and we're thinking about, especially data modelers and architects, the data quality, the integrity, and not so much about all those little options and features to the side of something. So a lot of what I'm going to talk about today is about designing columns that are inside tables, because that seems to be what takes up so much time. So we're just slightly choosing the wrong feature might cause us to put something in our design that'll work from the beginning, but maybe not so much later on. And then people who are on call have to deal with them even if they weren't involved in, let me make sure I show the results, even if they weren't involved in the design or making the decisions. And this is one of the common dynamics between data modeler architects and DBAs and developers is, you know, a database is having a problem, sometimes an outage problem, and the people who get to call at 3 a.m. aren't actually the people who caused the problem to happen or let the problem happen because they didn't understand, like a modeler choosing the wrong feature didn't understand what the implications are or letting people know that something needed to be monitored. So we're going to talk about, this is another one of my modular presentations. We're going to talk a little bit about a specific issue or challenge, a little bit of background, might tell some stories and what you can do about it. Again, because of time constraints and also my personal preferences, I'm mostly talking about transactional designs. And it's something that came up last month at a presentation I did. By transactional, I just mean not data warehouse or not reporting. I mean traditional transactional database. It doesn't necessarily mean high volume transactions. It just means it's usually the system of record or something that's recording live inserts and updates and reads through normal transactional processing. And of course any of the answers and any of the descriptions of the problems or solutions that I'm going to recommend today is based on a particular data story or usage of data. So you can't use these as always or never. Of course it's always, it depends. So what do I mean by a time bomb? I don't mean mistakes that we do in database design. So I have presentations on these. Tom has presentations on these. Other people have great presentations. I mean something that's a perfectly fine design choice. And it could be misused. It could be something that's misunderstood. And one of the first topics I'm going to talk about, I think really from the presentations I've done, I could turn it into a full hour presentation as well. It could be a feature that is great for prototyping or great because you use the defaults and you don't have to think about it. But these things that design features that we put in our database without thinking about when we're going to run into a problem lead to these 3 a.m. calls. And the way we avoid this, most of the responses to how do we keep this from happening is besides building databases or designing them and building physical models, we need to make sure that there's monitoring and care and maintenance happening of our systems, of our data to see if the profile of data is changing, to see if the transaction loads are changing, to see if the volume of data is growing beyond what our original design is. And I know as I talk to a lot of data modelers, they feel that, well, that's not really our job. Well, I think it's kind of our job to make sure that the processes are in place because we know what the business requirements are. And as I talk through them, my storytelling is going to be about, you know, you think about things at the time you're designing what you know about them, but the business is changing and the data stories are changing. And we need to make sure that we are continuing to understand that that changing environment also may lead to design changes that we need to consider. Oh, you're ready. So the first one, identities. So a lot of this is going to have a SQL server spin to it, but these concepts are true for Oracle, DB2, Sybase, anything else. The identity property in SQL Server is a property of a column. In, I believe, DB2, it's a data type that you put on a column, but they basically have similar features of an identity property is something where the database is going to automatically insert a number, usually with an increment, a starting number and an increment. So in SQL Server, you create a column like person primary key and you say it has the identity property and the default, if you don't say these, is to start with one and increment by one. So what happens in the database is you don't insert that value. You insert all the other columns in the row. You define all the other attributes. And when you insert a row, the database engine will find out, will keep track of what the next number is for this. And in SQL Server, it keeps track of it in a systems table. And it inserts a row. So if you said one comma one here, it would, in their first row, probably insert one. And then in the next row, it would insert two. And in the next row, it would insert three. I say probably because there are all these kind of conditions where you might end up with a gap. Those numbers at the bottom are the number of, it's the highest range, highest number of values that can happen in that column based on the data type. So for tiny int, it's 255. A tiny int data type in SQL Server can start at zero and go as high as 255. It doesn't support signed or negative numbers. And then we have small int at 32,000, integer, and then big int. Now, if you look at the size of those numbers, it seems like that that's a really big number for big int that you would just never run out. That you're not going to have that many customers. You're not going to have that many products. Where this all gets crazy is when you start using them at tables that are sort of multipliers of each other. So think of an associative entity or an intersectional entity where the numbers, you know, you have already big tables and now you're associating a lot of the values in one table with a lot of the values in the other. Now, the other than in SQL Server, other than tiny int, all of these values can have negative numbers. But notice I said that in a lot of databases, either people via practice or reading on a blog or a book start with one and increment by one. Well, already, if there's a signed number there, meaning you could have negatives or pluses, you're already now limiting the number of values you can put in it by half or about half. So instead of, you know, being able to have twice as many numbers, you can only have half. Because what happens is if you had said this was a small int and you had 32,000 possible values, when you hit 32,768, you're not going to be able to insert that row because you've told the database 32,767 is the maximum number of values. Well, that's a huge problem because you can't insert and I'm going to guess that if you were inserting your customer record and you could no longer insert customers that your application or your website isn't going to react very well to that, it's either going to completely fail or it's going to send, you know, stop all these transactions. It all depends on how the application is built to respond. The one thing I do know is your database isn't going to be able to fix it. And you can't just come in and say, okay, now start with the negative number. You can't just alter in most databases the table and say, hey, let's start using the negative numbers. You actually have to create a new table, like an extended alter. Create a new table, insert, move all the data over to it because you've defined the identity column to now allow for negative numbers. You might also want to be able to change the data type to make it bigger. Say, for instance, going from small integer to integer or big int. All of that requires some downtime and it requires a 3 a.m. phone call. Okay, it always just seems like a 3 a.m. phone call. So this is a typical error message you might get that your system is going to get. The real issue, identity is in SQL Server the most common way of creating primary keys. So it hits so many tables. It's every book that you read, every blog post that you read, it's always going to talk about that. And it's almost always, every example you see is going to start with one and increment by one. Either because it's the default or because that's what you've been taught. The third reason, which is the one that really gets to me, is that people tend to use these surrogate keys, these machine-managed keys, as human-facing numbers like your customer number or your product number or your vehicle number or something like that. And humans in general don't like negative numbers. I mean, who wants to be customer negative 666 or customer negative 3,224? So we choose to use what's something that should be a technical design decision, but we already constrain it by saying we're going to use it also as a business key or a business identifier, and that's where things go wrong. So what happens is, is that we as data modelers, I know I was taught, if someone wants it to be identity, we'll make it identity and we'll start with one and increment with one. And a lot of tools, that's even the default when you go to create this. If you do that, you're already setting up your 3 a.m. production calls to come earlier. So how do you fix it? We talked about we can reseed things. We can start trying to use numbers, old numbers. Well, that's going to be a problem if you have a unique property set on this. You might try filling in gaps. So remember I said, sometimes you might end up with a gap in your identity column. She could try to start it, but you're probably not going to have enough gaps to just keep going. You might find out most people misunderstand identities and think because they're used a lot for primary keys that they're guaranteed to be unique. Nothing in the identity property and SQL Server makes it unique. And in most databases, you would need to also say that it's unique, either because it's part of a primary key or because it's part of a unique constraint or index. And the other way to fix this is to change the data type. All of these cause problems with bringing the database down. So Tom and I have been working on a script that will go that you can run against databases and with the normal disclaimers about running anybody's script off the internet against the database, is that you should read it and understand it and maybe not do it in production until you really understand it. And what it does is it goes through and for all of the tables in your database, it will find all of the... So this is a SQL Server one, but there's no reason why it couldn't be adopted to run on other databases. It goes through all the tables and tells you, for each column that uses the identity property, what its name is, what its current row count is, what, based on its data type, and how the identity is configured, the total possible values, how many inserts that you have left, and what percentage of the identity values in this table are used, and then using a total swag, just a guess, based on a thousand inserts per second, the date that your 3 a.m. phone call is going to happen. So you can see, and there's more data returned than this, I just copied out the results for what I could see, is that probably these upper rows are tables that are code tables, departments, locations, they've got just a handful of rows. It looks like they're defined as integers or small ints, but then you look at that Kitty Business Entity ID, and it's got a lot of inserts remaining, but I can tell you that on Boxing Day of this year, at a thousand inserts per second, we're probably going to run out of integers, and the reason that's going to happen is, of course, I've done this for demo purposes, is I've defined the entity, the identity as starting with one, but incrementing by a thousand, and you'd say, why would you do this? Well, some people consider this a good practice to not have a singular, you know, increment of one, but an increment of a thousand for when they're using this identity property as the clustering key, so the order in which data is physically stored on the disk, and then you spread that out, and then you can insert in between, and this may or may not fit your data story, but I have come across database designs where the increment is set to 100 or 10 or 1,000, because they put meaning into these identities. There's another feature used in DBMSs. Oh, yeah, Boxing Day. Boxing Day is the day after Christmas, if you didn't notice that on the previous slide. Sequences are an object in databases. This is fairly new to SQL Server, but been around forever in other DBMSs, and they kind of work the same way. The difference is they're not a property of a column, they're a separate object. They're just like a stack of numbers, and you can go over there, and you can go get the next value out of that stack of numbers, and use that in a table. You can use it in several tables. You can have a sequence one per, this is used quite a bit for primary keys as well, because it's got similar things. Notice that it starts with something, it increments by something else. In SQL Server here in this example, you can put some other features around it, whether it's cycles, whether it's cached, what it's type is, but it has very similar problems in that if you start with one, and increment by one, and you use integer, or small integer, you're going to have that same running out of numbers. So we could take the same logic as before, the same sort of problem, and have that issue with sequences. Why sequences are even, could have more risk, and this isn't trying to talk you out of using them at all, I'm a big fan of them, is that because they can be shared across tables, some other table, some other application or query could now say, go get next value, and go do it 10,000 times for their own table. And then you find out that this sequence is being used for 10 tables, and you're going to run out a lot faster than what you thought, than what you originally designed it for. So a lot of people who do data models, physical models using sequences, they create a sequence for every table. They don't have sequences that are shared, but there's nothing in the database that stops you from doing that. This is totally driven by the queries that are written, and we data modelers and DBAs tend not to write the queries, so we have no control over how fast they're used. The other interesting feature is someone could go in and make adjustments to the sequence, and you'd never know about it. Maybe they have a really good reason to do it, but they could go in and make these kinds of changes, and you wouldn't know. So similar issue to identity in the sequence world. You can restart it. You could go and try to deal with gaps where sequences, by through all the receding and everything, and you can change its data type. So that's how we restart. So how do you prevent these gotchas with identities and sequences? First of all, someone needs to do the math. Do data profiling on your existing data. Understand how many transactions we're going to have, how many inserts, how the things are going to be used, what's the potential for growth, and do the math. Some data modeling tools help you do the math by saying if you have 1 million customers and you have 10 million products, and on average, a customer orders five products on every order, and on average, you have 10,000 orders an hour, but on Black Friday and Cyber Monday, you have a million orders. You can now go do the math to know how big you need that data type to be. You should also, as a data modeler, I know I question it. Every time someone says the seed value should start with 1, I ask them, why 1? I think it should start with the lowest possible number so that we run out or that we have twice as long to deal with it. If people are saying because users are going to see it, then maybe identity isn't the right way to implement that particular data point. Look at the increment value and maybe use another feature. We need to be monitoring these to know so people either have these scripts or have tools that will do it to tell us where the max value is, the max current value, where there are gaps. So you should be able to predict with a countdown timer. Like I said, our script just does 1,000 inserts per second. That's just because we had to pick something to calculate a date. If I were you, I'd start collecting how many inserts per hour or second that every table has and start predicting, you know, using a script that will incorporate those estimates so you get finer estimates so that you know where your hottest tables are and you can prioritize. I'm going to take a minute to look at the Q&A. Okay, so that's just people telling me what they did. So lots of good stuff. I'll get back to the chat later. So the next sort of time bomb that could grow off is data types. So this is, you know, data modeling 101, entry-level data modeling. And I find that people pick data types that really are modeling their world as they see it, like an individual modeler. So when I was a young modeler and walking uphill in the snow, both ways to work, I found that, you know, the concepts that I knew about were also way out of line with all the data I was going to have in my database. So I see a typical design like this where I have customer ID and it's not null. And I have email addresses and it's bar chart 25 because I've never met anyone that has had an email address that's longer than that. But what could go wrong with that? Well, so I made up that email address here at the bottom of the screen that's based on a possible name and then all this metadata stuffing that some vendors do in their email addresses. So yeah, this is a totally made-up email address. But it's much longer than 25 characters. In fact, the Internet Standard or what the Internet calls Request for Comments, which is how the Internet is run, the Internet Standard for email addresses is that it should be 254 or 255. So let's just say 254. And that's a really long email address, 254. And maybe you don't want to support that. But I bet you if I went through a lot of your data models, I'd see the numbers when I ask people when I present of 25, 50, or something like that, 75 sometimes, is that if you're setting something up as variable character and the standard for that data says it can be 254, I think you ought to allow 254. That way your system doesn't have to fall over just because it found the person or as we're entering to the Internet of Things where everything's going to have an IP address and everything's going to have an email address and a Twitter ID and who knows what's going to be next, is that by supporting the standard and having the right links there, we're doing the right thing by it. The other mistakes with data types are not having the wrong one. So I see all the time Canadians and non-U.S. people love to lament the fact that while the in the U.S. zip codes are numeric, and I'll put a big asterisk there, that in the rest of the world, some of us, like Canadians, have letters in our postal codes, and while you've done this great design, someone has decided to make the data type for zip code slash postal code to be numeric, and my postal code won't go in there. So I often have to create a zip code that will fit within your form and fit within your database, and it's all wrong. The reason it's also wrong to use integers for zip code in the U.S. is we have zip codes that start with leading zeros, and when you put a data item into a column that has leading zeros, it's going to truncate those leading zeros because they're not specific. The other thing is, is that there's my name there, Lopez, and it has one of those funny lines over the O. That's actually how my last name is spelled. I never spell it that way, because those of us who have special accents or spaces or anything in our names know that for some reason, computer systems hate us, and they need us to spell things differently. So the data types in most DBMSs is if you just choose a regular character data type, it's going to limit the set of values of characters that can be used. So in SQL Server, we use inVarChar or inChar for these extended characters, and those are called international features, and a lot of times I'll get pushed back from a DBA saying, but all of these extended character sets, those data types take up twice as long, twice as big storage, and we don't want that because long data types take up too much storage and it hurts performance and it makes indexes bigger and it just hurts performance. Well, my answer to that is, is that seriously, if you want to tell customers that their names aren't valid names, that your customers' names aren't valid names, that their cities aren't valid cities, then why don't we just do that for everybody? Why don't we just limit all customer names to three characters and we'll have better performance? So that's a personal design decision. The way it comes back to bite you is that, you know, we're already now starting to entertain Internet domain names, email addresses that use these extended character sets. It's not the same. Karen Lopez at InfoAdvisors.com as an email address is not going to be the same email address as Karen Lopez with an accent over the first O, over the first syllable. That's not going to be the same email address. This isn't just descriptive information. This is prescriptive. This is, it's a different letter and that would be there for a different email account. So if we're going to start using the wrong data types, we're going to get the wrong data and that would be bad. The other thing is, Karen's rule of thumb that comes out in my contentious issues things is anytime the business calls something with number in the name, I can guarantee to you that it actually has letters in it. So if you think about it, account numbers, they have letters or dashes or spaces that are significant or symbols or emojis for all we know. Vehicle identification number, not a number. Right now in the U.S., so security number is a number. In Canada, the social insurance number is not a number. We just need to understand that the business loves to call something number when it's not. Otherwise, it would just call it what it was, like total sales. Anytime someone calls it number, that's your clue to go sniffing around, profile some data, find the exceptions. The wrong length and precision, we've already seen the problems with that in identities and sequences, but wrong length works the other way. Not only can we have columns that are too short, we can create columns that are too wide. So for instance, I'll get that, just make everything Varchar 2000, and you won't have to spend all your time digging up what the length is. And I call these bucket data columns, is that, yeah, that's a nice thing to do, until people start discovering that they can start hiding other data inside those columns. So I had a client where this was an extra column that was there just where data was being stuffed into a column that was supposed to have other data, but I've also had this issue happen with very wide columns, is that some people get tempted to do an end run around the data modeler and the data governance process by start shoving data, maybe using pipes or commas, into an existing column, and then doing the processing against it. But where that goes wrong is where someone else using the data has no idea that that data has been hidden in there and starts retrieving it and printing it on reports and on checks in the case of this client and on envelopes. So the key here is finding the right length and precision of something that has a business justification for that length, where it's slightly future proof, as well as not leaving wide open for abuse. So the problem we have to have with fixing any of those wrong data types, either it's the wrong data type or the wrong length or precision, in some cases we can just alter the column in a table, just a quick script, we go in, we take it from varchar 50 to varchar 254, and everybody's happened. But many of these changes require that extended alter of creating a new table with the right design, moving all the data back in, turning off all the constraints and indexes and RI and everything that's on the existing table, dropping that table, renaming the temporary table that we created, turning all the constraints and rebuilding all the indexes and rerunning all the statistics to bring it back online. That's downtime. That's time when your DBAs are getting called at three o'clock in the morning to go make a change because someone wanted a column to be varchar 25 instead of varchar 254. That's something that's the type of pain that you cause for your business that really has very little support. So we present this, again, doing more math, looking at your existing data, looking outside your little tiny view of the world, working with users, not being afraid to go long than what you think is possible on a column while also finding the right length for something. Also finding out about edge cases and getting out in the world a little. So denormalization. We data modelers all the time get told that we need to denormalize our designs because performance, because joins are evil and joins take time. So the typical denormalization that I've seen is instead of let's say, let's take e-mail address. We grew up when people had only one e-mail address and now we're extending the customer table to have multiple e-mail addresses. Well, we're happy to do that and the way a good data modeler does that is we create a new table, we take customer e-mail address off the customer table and we're now putting it into the customer e-mail table and we create a one to zero or many relationship between customer and that table. Boom. That's perfect. Except that, you know, someone says having to do that join for every time we pull up the customer data, that's so hard. So instead of doing that, let's do customer e-mail address one, customer e-mail address two, customer e-mail address three and everybody's happy, right? Because all we need are three. Marketing has said we need three, customer service says we need three. That's actually a legitimate look at, you know, kind of the trade-off of supporting one to many of something. And e-mail address is a good example of that because I bet if we look through all the accounts we've set up, all the devices we have, you know, Facebook, everything, that if we did a true accounting, we'd probably find that each of us has maybe 25 or 30 e-mail addresses, of which 20 of them you didn't even realize had been set up for you, like your Facebook e-mail address. If you have Facebook, you probably have an e-mail address associated with it. You might not know that. But in general, most businesses aren't going to care that you actually have 25 e-mail addresses. Maybe, you know, five is a good number and maybe we do that. We set up five columns in a table. It's denormalized. That means that head cod hates that thing you just did. But that might be a reasonable physical design workaround. The reason it might go off is what happens when e-mail address one is no longer a valid e-mail address? Do you delete it and move all the ones left? Do you delete it and have a null there? What happens when one and two go away? So now you have three. So every query now has to go look in one, two, and three to go find an e-mail address. Every query has to deal with it. When you go to insert a new e-mail address, do you find the first empty column or do you always put it in three? These are the types of complexities that come up with this workaround. Another one is manually calculated data, so derived data, but derived data that isn't run by the database engine and then rolling up or rolling down subtyping. Those are all types of denormalization. So here's my e-mail address in two and three. I have seen this implemented. It kind of works with three of them, but also notice because we had a legacy e-mail address because in this case someone marked it 255. Our first one was varchar 50. So now we've collected e-mail address two and three. One's gone away. If we're moving two or three into the regular e-mail address, now is it going to get truncated at 50? How should we deal with that? This all of a sudden adds a complexity to this. So the other things that go wrong is if we have manually calculated derived data in order to speed up a query, then everyone has to know, or we have to have a process or heaven forbid, a trigger that updates those things. And everyone has to know that data can get stale or that a manual update to fix something could not know that data needed to be fixed or it could hurt performance in another way. So here's another favorite one that doesn't have so much to do with what a data modeler does but with the queries for people using select star. Select star means go bring all the columns from this table back in my query. And people love to do it as a shorthand and they do it because it's faster. It's faster to write star than to name all the columns or because they're using an ORM or other code generating tool that just automatically does it for them. Even when you specify the columns in some tools, if you specify a majority of columns, the generated code that comes out of the tool will just transition that to a select star. And the problem with those are is that if you're using select star, a query to pull back data that might be being pulled into another table like insert into or select into, all of a sudden you're bringing back columns that your target table didn't know about because someone did maintenance on the select table and something's going to fail. The other reason is what if people put in some very large column like a blob or an image or XML or JSON and all of a sudden you're pulling back a whole pile of more data and all of a sudden your application is slowing to a crawl. So developers and development tools love select star because everything just goes faster until it stops. So this is the hard part about select stars. You're trading off developer efficiency, which I'm always reminding my developers, developer efficiency is not in my project charter. Not one of my goals. I want you to be proficient and efficient, but it's not the goal for a design. I'm going to take a minute and look at Q and A. What do you think of year being number type or character type? So if we're talking transactional systems, I typically would put date parts as if they were separated. First of all, if we're talking date or time, I'd want to use date or time, those data types. If we're talking year, I'd want it to be an integer because it's quite likely I'm going to still do math on it. I'm going to do how many years from something or to something. So my Karen's rule of thumb for data types, whether to use character or numeric for a value, is whether or not we're going to realistically do math on it. If we're going to do math on it, then I want to store it as something numeric. If it's something like a zip code, which in theory, we're not doing math on it, it's just a numeric postal code, then I'm going to want to store it as character not just because the leading zeroes, but because it's something that we're not doing math on. So another example in the U.S. is social security number. I don't want to do math on social security numbers. I'm not going to subtract my social security number from yours to come up with some magic number for that. Now having said all that, I do know it's common in some warehouse designs to store dates as characters and get some performance gains on those, and I'm not really clear on that data story, but I do know it's common. So how do we monitor for select stars, though? Select stars is good governance, so that means code reviews and script reviews, and there are even maybe features in your database tools or your development environments to go sniff out all the select stars in your query design so that you can point out how much risk we're taking by doing them. I know in SQL server I believe you can use things like policy-based management to find them. Those are for scripts, for queries that are inside something like a stored procedure, something in the database. Those wouldn't apply to all the scripts that are run outside the database, like in your application. But select stars are the big sort of gotchas for, again, your system is going to fail hard when it tries to insert data into a column that doesn't exist. There's no way to gracefully fail out of that. Your application might say, hey, we can't do this process, but you're not going to get the data. So if people are using select stars, we could just constantly play catch up with fixing the source table, fixing the query, fixing the target table, or getting rid of your ORM tool that's allowing it and possibly reeducating your developers that they need to be stopping all the uses of select star. So someone's claiming they use select star on dim day queries. Well, that's good. You might go off sometime. So as we get close to the end, you can let me know in Q&A if you've come across any sort of MacGyver moments, times that you've had to fix something with a pencil, a paper clip, and your DBA at 3 a.m. I'd love to hear about those. So I'd like to know what went wrong and how did you fix it or prevent it? Anybody got anything? I'll go back through the Q&A. All right, some good points. So some Social Security numbers. So I don't believe I didn't think that Social Security numbers had leading zeros yet, but I know the Social Security Administration went through some major changes about how they assign them and how many values there are definitely EINs, which are like entity or tax identification numbers for people who aren't eligible, people or companies who aren't eligible for Social Security. So dropping leading zeros can be a problem. Someone has said, I think that's the default data type for all strings that the developers are using entity framework to drop or create. So Varchar Max, it could be if you were using some tools like in SQL Server, the default data type for some tools was in Varchar 50. That reminds me of other default data types. So in Irwin, the default data type, if you don't specify one for a column, I believe is character 18. So that was a good way of reviewing data models in Irwin. You should go find all the columns that had character 18. That was probably a good code smell or clue that someone forgot to specify a data type. I actually have a blog post over on the embarkadero website. You used to be able in ER studio to change the default default. So what the tool used as a default data type to a negative number. And I believe you could do that in Irwin as well. But ER studio has fixed the fact that it won't let you choose a negative length to a data type and you'd say why would you do that, Karen? That's not valid DDL. That's exactly why I would do that. I made the default data type so what the tool would assign as a data type if you didn't specify one on your own to char negative 99. And the reason I wanted to do that was that I wanted the DDL to fail when I tested it because while I could write macros or search or find all the columns that had a data type set to something weird is that I wanted the DDL to fail so that the DDL couldn't be put into production with sort of incorrect data type. And so what I recommend now is setting it to a data type that you realistically would never hit. So you set in your tooling setup, you set your data type to something like, see I don't even know, because it would all vary based on your data stories what would never be. And I'd love to hear your sort of default data types that you would use. So I'm not really seeing anyone telling me what sort of designs that they've done that ended up not doing so well for them. Let's see. What about identification fields that you expect to join? Would you model those as character fields or numerics? So by identification fields I don't know if you mean identity like identity property or just primary keys. I want to give data, the data type that's right for it. You know, something should be integer or character because the data itself should be integer character. Now, having said that, most modern relational database designed these days assigned surrogate keys as the primary keys. And in virtually all cases, except for GUIDs, that primary key will be an integer or a client or something like that because integers take up less space to store, which means the primary key and if it's the clustered key it's smaller, is that makes indexes smaller, it makes finding values go faster because the index can be read faster because the smaller the piece of data is, the more rows of data that fit on a page in the database. There's all of this sort of rationale for why everyone wants all data to be numeric and only tolerate characters as the last resort, which is how it was explained to me once. The problem with that thinking is that those all end up being time bombs as well. So I'm an advocate of choosing the right data type for the data and it shouldn't matter whether we're joining on it or not because it should be the right data type but often the way we get around it is we typically join on primary keys to foreign keys and we typically make those surrogate keys. So that's all we're having there. So some other time bombs that could happen not just inside a table but inside a database. So collations, a collation factor is something that tells you how data gets sorted sort of basically. So for instance, different collations cause different things to happen. For instance if you would order by customer last name and your collation was set to let's say a Spanish collation where they have other letters than what we have in our native English language and they have letters that to us would I should say us like it's all of us but to me that in English look like two letters like CH or letters with special characters that have a special sort order all of those things. Setting the wrong collation or requiring a collation on a table can end up giving you unexpected results. Definitely the biggest time bomb in a database is running out a storage space that's why there have been products out there to help you monitor for when you're going to run out of space and that includes both inside the database and in transaction logs that get applied. Definitely a career limiting move and a resume generating event would be having either no backups or bad backups that you haven't tested and therefore no restores. Let's see what others of these do I like to talk about. Bucket columns I talked about. Some of these I can't talk about. Phonals are setting all your columns to not know but then having to have fake nulls like January 1, 1900 because dates if you set a date field a date column to not know you have to give every row a value there and therefore they'd have to have those things. Using goods so these really long globally unique identifiers when sequence write do. There's just all kinds of things that happen that will get you through your day job but either end up with a lot of night work that you have to do or causing it for someone else that could also be a career limiting move. And I'd still love to hear about all your other things that can happen. So remember today we talked about things that are perfectly fine design decisions but weren't fully implemented, aren't monitored and managed, aren't fully understood, require emergency response and sometimes your paperclip and a stick of gum is not enough to take care of that. So I wanted to thank you so much. I didn't have time to go through and read all the chat but I'm definitely going to thank you and would love to hear about any other things that you found in working with your data models and databases that we should all know about and maybe I'll blog about it. Shannon, that's you. All righty Karen, thank you so much for this fantastic presentation. I love it. I love all the examples that you give throughout. And just a reminder to all the attendees we will be sending out a follow-up email within two business days and a link to the recording of this session and anything else requested throughout. In addition you can meet Karen live and in person at Enterprise Data World 2016. Currently you can use discount code holiday to get additional savings off of that or if you're a DMA member and DGPO members likewise can also get savings for Enterprise Data World. We hope to see you there. And thank you to all the attendees for your participation today. We'll turn off the recording now and then Karen you can speak in an unrecorded fashion. Excellent. That's my...