 What's going on everybody? Welcome back to the Excel tutorial series. Today we'll be looking at how to clean data in Excel. Now knowing how to clean data in Excel is actually extremely useful and there are a ton of techniques to do this. I'm going to be showing you the ones that I probably use the most. I feel like are the most helpful to kind of do the bulk or the majority of the data clean that you're going to do in Excel. Like I said, there's so many different ways and some very specific things that you can do, but I'm going to highlight some of the bigger ones that I find the most useful. And some of you may be thinking, well, I'll just do my data cleaning in SQL or Python or when I get it ready to put it in Tableau. But honestly, a lot of the data cleaning, at least a lot of the big stuff, I tend to do in Excel if the data set is small enough to fit in Excel. And so I think it's actually really, really useful to know how to do this because you'll most likely be doing it more than you think. Now before we jump into the tutorial, I want to give a shout out to the sponsor of this video and it's a brand new sponsor. It is Unlocked by Z by HP. Unlocked is a movie that's actually broken up into four parts and each of them have a unique data science challenge associated with it. Now I'm going to read this next part because it's extremely interesting. Each challenge represents a different topic. So there's data visualization, text analysis, audio signal processing and computer vision. And you can submit your answers and your work on their website for a chance to win one of 10 Z book studio laptops or a free trip to the Kaggle world championships. So I'll leave a link in the description where you can go watch the movie and then do the challenges and then submit your answers for a chance to win. You should also go check out their hackathon where you can do these projects with other people just like you who are trying to figure out these answers and submit them to win as well. So go check that out. Thank you again to the sponsor of this video Unlocked by Z by HP. Now without further ado, let's jump onto my screen and get started with the tutorial. All right. So let's jump right into it. I have this US president's data set. I got the base data set from Kaggle, but I added some of my own data and then I messed some stuff up as well just to kind of demonstrate some of these things I want to be looking at today. This is not a full project. So, you know, we're actually going to be using this to create any visualizations or anything like that. So, you know, all this is just for demonstration purposes, but we will be doing a full project in about two or three videos in this Excel series where we're going to be doing from start to finish with a real data set. So, you know, if that's something that you're wanting, then we will absolutely be doing that. Now, something that you may be wondering is how do you actually identify what you need to clean in the data? What do you know to look for? Well, some of the obvious things are things like formatting and standardization. So things like, you know, this James Monroe is in all caps that happens all the time with real data. And so, you know, you want to standardize that or this all lowercase you want to standardize that you want that all to be the same. There's also things like right here, or we have this wig and this wig with a bunch of random stuff after it. This happens all the time where it's not completely standardized. And you may even notice, you know, there are some spelling errors in here and I'll we'll kind of look through that in a little bit. And then, you know, there are things like additional spaces where there shouldn't be spaces. There are things like currencies that you need to be aware of, if you were importing this into or going to be importing this into a SQL database. Things like currencies can be just a problem or be really unnecessary. It may actually cause more issues in the long run. So you may just want to, you know, take that to the base value. And then dates are always an issue. Always, always, always. So always look at your dates, make sure they're they're formatted correctly, make sure they're all the same. These are the types of things that right when I glance at this data set, these are things that I'm looking for. One other thing that is actually the first thing that we're going to start out with is you want to make sure that your data is not duplicated because if your data has duplicate data in it and you don't want that, it's not supposed to be there. There are some specific use cases where duplicated data is okay. You know, you want to get rid of that. And it's very easy to do in Excel. The first thing we're going to do, we're going to go to this data tab. We're going to go right over here and we're going to see if there's any duplicates in our data. So we're just going to go up to remove duplicates. It's going to automatically choose all of your columns to check against. So it's going to, from A all the way through I, it's going to see is the exact same data in all these rows. And if it is, it's going to get rid of it. And so we're going to click okay. And it did find one duplicate. And I'll show you that one real quick because, you know, it was right here. So Barack Obama was here twice. And then I'm going to hit Ctrl, I hit Ctrl Z to go back. I'm going to hit Ctrl Y to go forward. And it removed that row completely. Now, in this example, you may be able to spot that with your eye, but in a real data set where you have 10,000, 100,000 rows, there's absolutely no way you're going to see that. Or very, very unlikely that you are going to see that there's duplicated data in there. So just running a quick de-dupe or removing of duplicates, that is really important to make sure that you have gotten rid of those things. So that's one of the first things that I do. We're going to go into a lot of these different columns. And I'm going to kind of show you different techniques or things that I do when I look at actual data. So I'm going to come right over here. I'm going to Insert. And this is what I actually do. I usually create a separate column, especially when I'm working with this, because I don't want to change this one. I don't want to go in here and, you know, say equals upper equals proper, etc. There's a lot of different ways that you can change names or not a lot, but the main ones that you can change names and all of them are completely okay. So for example, I'm going to hit equal upper, and I'm going to go like this and close my parentheses. So I selected the cell, I close my parentheses, hit Enter. It is complete. And I'm going to hit, in the bottom right, I'm going to hit double click this. It's going to apply it to all of them. It is completely okay to have your data like this if you want it to be like that. If you want it to be all lower, you can do that. If you want it to be in proper case, you can do that. There are different uses for all of them. And honestly, as long as it's all the same, typically it's okay. But if, you know, for example, if you're selling this to like a third party company or something like that, they may have what they want for their ingestion process when they take your file in. If you send, you know, a weekly file or a monthly file, they may want it exactly how they want it. And you can change that to what they want. But as long as it's standardized for you, it's all the same for you, that is a good thing. So now we have all of these in the proper case. That's typically what I do, or I use upper. Those are the ones that I use the most. I don't usually use lower. And if you go in here and you type in lower, you know, it changes it to all lower. I don't typically do that. And I'm going to add, I'm going to say president dash fixed. And so now all of these names, all of these different uppercase and lowercase, these are all fixed. And it just makes it so much easier to read and you don't have different uppercase and lowercase issues. It's all the same. So I'm going to keep that right there. If we move a little bit to the right, if you look at this prior, now this prior is a mess. It has stuff all over. And to be honest, this is not really something that I would probably be using. Like in a real data set, I would look at this column and I'd say this is pretty useless. If I had a very specific use case for this data in this column, I might try to parse it out and do something, but I don't. This is a completely useless column to me. So I'm actually going to skip this one. I'm going to go to this party one. And this party one to me, it looks pretty important because this is something that I know I can group by and I can create visualizations with and kind of break that out. And if you look right here, we're going to add a filter. So now let's open up party and take a look. So if we look right here, we have Democratic, Democratic dash, Republican, Federalist, Nonpartisan, Republican, Republicans, Wig and Wig with a date and some information, the back of it and then some blanks. And it's really important when we're looking at these ones that we think we might group by that we have these properly grouped. So Republican and Republicans to me right off the bat looks like a spelling error. And so I'm just going to deselect all. I'm going to go to Republican, Republicans, and it's literally Republican all the way down except for this last one. And to me, that's just something that I would update. So I would just go right here. I do that. If I didn't do that, and then I try to create, let's say a pivot table on here, I'll have its own group of Republicans and it wouldn't be added to Republican. And maybe that's on purpose. But let's just presume that we know this data extremely well. And that's not supposed to be like that. Right? Again, that just comes back to knowing your data really well, understanding what it, you know, what it should look like. And we know that it should not be like that. So we're going to fix that. The next thing that we're going to fix, and as you can see, it got rid of it. Next thing we're going to fix is this wig. That's just like an error. That's some issue on the data side. And we're just going to fix that by updating it. And that's it. I would always be keeping a copy of this with the raw data somewhere else, because this is presumably like a working document. This is not a, you know, you aren't saving over your original file. Let's just say that. And then let's take a look at these blanks real quick. Okay. So there are these rows right here that have nothing. I think we're okay. But if we see anything different, 47, 48. Okay. So yeah, it's just these ones right here that have no data in it. Anyways, it's just seeing it in the filter. So not an issue at all. So okay, we're looking good. We've gone all the way over. We fixed this president. We skipped this one. We cleaned up this party. And I kept this one in here because I'm not exactly sure if that's a Democratic or Republican. So I'm going to keep it its own thing. I'm not a huge history buff in that aspect. The next one right here is, the next one right here is really easy. This is something that happens all the time, especially on actually, most often it all happens on numerical data. So like, you know, there'll be a number of 1001, then there'll be a space after it for absolutely no reason. And it happens all the time. It does happen like this as well, where you'll see this. And all you got to do is do trim and select the cell and close that front seat. And we're going to apply that all the way down. What is so fantastic about the trim is that it's really intuitive and it knows basically everything it needs to do. For example, it gets rid of the spaces before. It gets rid of extra spaces in the middle. And it'll get rid of extra spaces at the end, which you wouldn't be able to see, but they are there and they absolutely can cause issues. If you have spaces at the end that you cannot see, let's take this one. For example, like if I had spaces at the end, that can cause issues when you insert or put that into a database. That happens a lot with numbers. You know, when you're putting that into SQL, that can cause issues. And so you really, it is important to actually do that trim. And you can do that on all of your columns or just ones that you know you're having issues with. But once you import that data into SQL, you will know if there's an issue or not when you actually try to start using it. So we're going to say vise and we're going to say fixed. There we go. This next one is one that you'll run into a lot. When you're working with numerical data, you will encounter so many different issues. One that I run into a lot is I've worked with a lot of cost data or pricing data. And when it's in an Excel, it sometimes comes in with these currencies like a dollar sign, a pound sign, things like that. And when you put that into SQL, it just is a nuisance, right? You're not going to be able to run, it's going to go in as a text or it's going to be like a string, right? Because it has that special character. You don't want that. You don't want to have to then go in and then change things around. You just want to be able to start doing calculations on those numbers. So what you can do is sometimes it'll come in as a text, sometimes it'll come in as a currency, which I think this one's a currency. We are just going to change that to be a number. And then we're going to get rid of these, oops, and get rid of those. It doesn't look as pretty, but that is much more useful than actually having the currency on there with the decimals. This actually is so much easier when you want to use it for almost anything because you're able to add and do things properly in other systems. In Excel, I think it does understand it, but that can cause issues. So there is how you do that. The next thing that we're going to look at is these dates. And just notoriously, whenever I see a date field, I know there's going to be an issue with it. It's very rare that I get a date field that is perfect. It just, it is genuinely is a novelty when that happens. And most of the time it has to do with, let's say a date comes into Excel and it's in a text format or date comes into Excel and they're not the same. In this example, they are not the same. And we just want them to all be similar. They say date. If you look right here, it says date. It says date. It looks like it should be the same. But if we go like this, it all looks the same, right? There's no issues at all. If we were to try to use that, it may or may not be an issue, but we don't want to leave that to chance later on if you're using this with Python or something like that. It can cause issues. Maybe not in SQL because it may see the underlying, what's in the underlying cell, not just what we see, but some systems won't. And so you want to make sure that they're all the same. And so, you know, what we were doing back here with a party and we were looking at this filter and identifying the issues, I usually do that on date fields as well. And oftentimes, just for demonstration purposes, oftentimes, I will get something like that. And then I'll come up here and I'll notice that there's this one random number. That happens all the time. All the time. And so, you know, you want to make sure that you, that you look at these things and just do at least a quick glance, if not kind of doing a kind of a deep dive into it. All we're going to do is we're going to do both of these. And we're going to do a short date. And let's take a look and see if that fixed it. And so now they are all the same format. And that is fantastic. That is exactly what we want. We're going to go back through here. We're going to get rid of these. Again, this is a working, this is a working document. Oops. We need to, I'm going to do control shift down. Oops. Let me go back up. Do control shift down and copy. And what I'm going to do right now is I'm actually going to copy. All right. Let me do it right here. I'll show you. Sometimes I do this. It just depends. I'm going to go right here. I'm going to hit right click and I'm going to paste as a value, which means it's not going to take the calculation or the formula that I just did. It's going to actually paste it as that value. So we just replaced it. Right here you can see up here it says equals trim of G2. This now, now that I copied and pasted it over as a value, it got rid of that calculation and now it is actually a string. So we don't need this anymore. And I'll do the same thing over here as well. I'm going to control shift down, copy and I just hit the right key or the left key. Sorry. Now I'm going to right click and I'm going to do paste as a value. And again, it has this proper and now it doesn't have the proper. It's actually the value that was here. So that's really important to note. And we're going to get rid of that one. And so now what we have is already looking much better. Now one of the last things that we're going to look at is deleting columns that we are not going to use. And this is why it's so important to keep a backup or the raw data not in this file. Because if you start saving over this file and this is your raw file, that can mess up a lot of things. And that happens to me before and it's terrible. And then you have to request another file or you have to go back and find it or something like that. It's terrible. So this is our working document. So we can mess with this and do whatever we want for our purposes. Now for us, I can already tell you that this prior is a bunch of nonsense and we do not need it. We're not going to use it for anything. And if we have, this is a very small dataset. It only has like, let's say, one, two, three, four, five, six, seven, eight. We have like eight columns that were kind of using that data, eight or nine. Now that's a small dataset. I've had ones with literally like hundreds. And it has so many columns, so much data. And sometimes it's good to just trim it back to the things you know you're going to use. This to me is absolutely useless. We're going to delete that. And then right over here, it's pretty redundant. It's just one number off. But if we scroll down just a little bit, it goes, it's basically just counts. It's a, you could even call it a unique identifier if you want. Sure, why not? But we don't need both. So we're going to get rid of this first one. And now we have more of the useful and relevant data rather than the stuff that we absolutely know that we are not going to use. These date updated and date created, we may never use them, but we might. So it doesn't hurt to keep it on hand. Those other ones are ones that we are almost certain we will never use. Again, keep it back up just in case you need it. You can always go back and get it. So, you know, if you go back to what we started with and you look at what we have now, it is much cleaner. It's much more usable. And these are small subtle changes, especially with this very small data set of only like 50 rows or 46 rows. But you're going to be working with data sets that are thousands, tens of thousands, hundreds of thousands of rows. And you need to know how to kind of look at this data, standardize it, format it properly for what you're going to be using it for. If you're keeping it in Excel, there are different things that you may do than if you're putting it into a database or going to be using it with, you know, using Python to access it. So you need to kind of know your use case. But these are some things that I do all the time to kind of clean up the data before I use it for something, whether I'm creating pivot tables or I'm inserting it into or I'm putting it into SQL. These are things I do all the time. And so hopefully that helps give you kind of an idea of some of the things that you should be looking for when you're actually cleaning data. And it's really important to understand why you're actually making these changes and the reason you're making these changes because some of the things that I did today may not be things you want to do on a different data set that has different uses and different purposes for. So, you know, take everything that I've said and apply it with a little grain of salt to your data set because your specific needs may be different than what I wanted when I was cleaning my data set. So I hope this was helpful. I hope this gave you a small glimpse of some of the things that I'm looking for when I clean a data set or I get a new data set in and I'm kind of, you know, analyzing it, figuring out what I need to fix in it. I hope this has been helpful. With that being said, thank you so much for watching. I really appreciate it. If you liked this video, be sure to like and subscribe below and I'll see you in the next video.