 Hello everybody and welcome to the very first project in the MySQL series. Today we're going to be focusing on data cleaning. Now if you don't know what data cleaning is, it's basically where you get it in a more usable format. So you fix a lot of the issues in the raw data that when you start creating visualizations or start using it in your products that the data is actually useful and there aren't a lot of issues with it. So that's really what data cleaning is. Now what we're about to do is create a database. We're going to import a data set. This is a real data set. And what we're going to do is we're going to clean the data. So I'm going to show you and walk you through all the steps in order to clean the data. The data set that we're going to be working with will be in the GitHub. So you can just go and download that. I'll have a link somewhere in the description. But let's get started. First thing we're going to do is create a new database. We'll go right over here to create a new schema. And we're just going to call this one. We'll do this is world underscore layoffs. So if you can't tell already we're going to do world layoffs. That's the data set that we're going to be doing. We'll just click apply and that creates our world layoffs right here. Now we're going to go into here. There are no tables. We're going to right click on tables and go to table data import wizard. Now we haven't done this yet in this series. We haven't imported any data but that's what we're doing here. We're going to show you how to import data. So we'll go ahead and click browse. And as you can see right here we have this layoffs data set. Let's open this up and we're going to click next. And we're going to create a new table. There's no existing table in this database. You can drop it if it exists. If you'd like to it doesn't matter if this is new. We're going to go ahead and select next. Now right here is where you configure import settings. Now MySQL is going to automatically assign a data type based off of the data in these columns. So we'll take a look at the data later. Now there is one thing that you can take a look at real quick. We have this date column. Now in here it assigned it as a text. That's because of the format. We are going to import this as the raw data. We're not going to try to change anything in the import settings. We're just going to assume this is how the data was in the table. So we're not going to change anything. Although this may be something that you would want to change to something like a date time and go and fix that. But we're going to import this as the raw data. Let's go ahead and select next. We're going to import it. We just select next. Now this could take a little bit. So while this is importing, I'm just going to skip ahead. This should take just a few minutes to import. All right, this just finished. Let's select next. And we imported 2361 records. Let's go ahead and select finish. We can get rid of this. And let's refresh this. Perfect. We have our layoffs table. So we'll select everything. And I'm going to go and double click on the world layoffs because I don't want to write out the whole thing every time. So we're going to say from layoffs. And let's see what we get. So let's take a look at the data that we're going to be working with in this data cleaning project. So this data set is layoffs from around the world starting I think 2021. And we'll take a look at that in this date column later. But it has the company. So it has the company that did the layoffs, it has the location of where they are, what industry they are part of, how many they laid off, the percentage that they laid off. So the percentage of their company, the date, the stage, which refers to the stage that the company is in, whether it's a series B post IPO, they don't know. Then there's the country. And then we have funds raised million. So we have a lot of information here. And in the next project, we're going to be doing exploratory data analysis. So we're cleaning all of this data. And then in the next lesson, we're going to actually dive into it and try to find trends and patterns and all these other things. So what we are going to do is we're going to go through multiple steps. Step number one is we are going to try to remove duplicates if there are any. That is the first thing I typically do, especially if I know this data shouldn't have any duplicates or it'd be repetitive or unnecessary to have duplicates. The second thing is going to be to standardize the data. That just means that if there are issues with the data with spellings or things like that, we just want to standardize it to where it's all the same as it should be. Number three is we'll look at the null values or blank values. And there's a lot of no values in here. There's even a blank value right here. And we're going to see if we can populate that if we can. And there are times where you should there are times where you shouldn't. I'll kind of walk through that as well. And lastly, we want to remove any columns and rows that aren't necessary. And there's a few different ways to do that. This one is a little bit, you know, let me write this actually real quick, remove any column. So I'm just going to say there are instances where you can do this, there are instances where you shouldn't do this when you're working with massive data sets and you have a column that's, you know, completely irrelevant, completely blank, you don't have any ETL process that is required for it. You can get rid of it and it can save you time when you're querying your data. Now, with that being said, and we'll talk about this later, in the real workplace, oftentimes you have processes that automatically import data from different data sources. If you remove a column from the raw data set, that's a big, big problem. So what we're going to do is something I would actually do in my real work, which is I would create some type of staging or raw data set, let's say this one's our raw one, and we could have even called this layoffs underscore raw, we're going to create another one, we're going to create a table. So we'll say create table. And let's call this one layoffs underscore staging. And we literally just want to copy all of the data from the raw table into the staging table. So we can do that really quickly by just saying like layoffs. And if we run this, and we refresh, you'll see we have this staging database. And let's copy this. Here we go. We'll do layoffs underscore staging. And so now we have all of the columns and all we have to do is insert the data. So we're just going to say insert, then we're going to say layoffs staging right here. And we'll select everything from layoffs. And let's run this. And if we select the table, we now have all the data over. So super, super easy. And now we have these two different tables. Now, again, why do we do this is because we're about to change the staging database a lot. If we make some type of mistake, we want to have the raw data available. This does happen. This is something that you do in the real workplace, because you're not going to work on the raw data. It just you shouldn't do it. It's not best practice. So I'm going to show you what I would actually do in my, you know, like a real job. So that's what we're going to do. Now we're only going to be working off the staging database. And we can copy this and make different databases for different things. As long as we have our raw data, we can really do anything we want going forward. And that's what we're going to do. So the number one thing we're going to look at is to make sure that we are removing duplicates, we want to make sure we don't have any duplicate data in here. And if so, we're going to get rid of it. Now really quickly, if you did my Microsoft SQL server project, we did something very similar, but we had an extra column over here that gave the unique row ID, which made it really easy to remove the duplicates. Here, there is no identifying factor that's going to be easy for that. So I'm just going to tell you upfront to remove these duplicates is not going to be easy. We'll walk through it every step of the way. So what we can do is try and do something like a row number, and we'll match it against all of these columns. And then we'll see if there are any duplicates. Now I'm just we're starting off strong. Okay, we're jumping into kind of some of the more advanced things. It does get actually easier as we go. But this is the actual order that I follow. So I'm going to keep it. So let's try to identify duplicates. So let's copy this. Let's pull this down, do underscore staging. There we go. Now what we can do is we can do row number. And we'll do that partition by basically, we could do every single one of these columns. That's kind of what we're doing. So what we can do is we can say everything, then we can do a comma. And we'll say row underscore number. And it would be just like this. And we're going to do this over. And we want to partition by all of these columns, essentially, we could just do a few for now to see if we get any hits, and then we can look at that. But they're going to be multiple companies that have layoffs in the same location and industry. Although they're total laid off would probably be different, the date would probably be different. So if we do something like company, let's do industry, we will do total underscore laid underscore off, comma percentage laid off. And then let's do date. Now I'm doing date with the back ticks, because date is a keyword in my SQL. So if we do it like this, it just really makes it easy. So we're going to partition by all of these things. So let's do partition by, and let's bring this down real quick. So I'm just going to say over partition by, and we're going to call this as row underscore numb. Now let's try running this. Let's see if it works really quickly. It's important. And over here, you can see that we have our row number. Now these mostly are unique. And these all look unique. I'm not going to scroll through all of them. But we want to be able to filter on this so we can filter where the row number is greater than two. If it has two or above that means there's duplicates. That means there's an issue. So let's go ahead and we're going to take this, we'll put it into either a sub query or a CTE. I'll create a CTE for this, because it's really easy. So we'll say four, or not four, we'll say with, then we'll do duplicate underscore CTE as, then we'll just do our parentheses. We'll paste this in here and get rid of that right there. And now we're going to say select everything from this duplicate CTE. Then we'll say where row underscore numb is greater than one. Let's run this and set a semi colon. Let's run this. And you can see that these ones have duplicates. So these are our duplicates actually, and we want to get rid of these exact rows. Now just to confirm that these are the duplicates, let's look at this one. I never heard of this company, but we'll take it really quick. And let's select, we'll say where company is equal to, we'll call this ODA. So let's run this. And it looks like these, no, no, no, no, these aren't duplicates. That's a good thing we checked, okay? Because it looks like these aren't the exact same. Although they're very, very close, these technically are not duplicates. So I'm glad we checked this. We need to do this partition by over every single column. That's what I'm realizing. So we'll do company comma location. I'm glad I'm genuinely glad we're, you know, it's good to make mistakes, and figure things out as you go. It really is important. So company location industry, total laid off, percentage laid off, date, then we'll do stage, and then we'll do country, and then funds underscore raised underscore millions. So we're changing the CTE to partition over everything. So now let's run this. ODA is not in there. That's the only one we checked. But let's look at Casper. I know this is these are the aren't these the mattress people. I know they had layoffs for guys. All right, let's take a look. It looks like this row. And this row are duplicates. These are our duplicates. So we are going to want to remove only one of those. We don't want to remove all of those. So just looking at this one example, it looks like this query is working well. So here's our duplicates. Now, we need to identify these exact rows. We don't want to delete both of them. When we look at Casper, there's the real one that we want to keep. Then there's a duplicate that we want to remove. We don't want to remove both. That would be bad. Now in my SQL, it's a little bit trickier to remove things than it is in something like Microsoft SQL server post grace SQL. They have different ways that they can delete rows. For example, Microsoft SQL server, we could literally identify these row numbers in the CTE and delete them from it. And it would delete it from the actual table. We can't do that in my SQL. And I'll show you. Let's actually copy this. We'll go like this. And we'll say, let's say we want to delete these. We'll say delete from, we're deleting this from where the row number is greater than one. What am I writing right here? Delete. There we go. So delete from this duplicate CTE where the row number is greater than one. That's all these duplicates. We want to remove them. Let's try to do this. Let's run it. Let's go down. If we look at the bottom, it says the target table duplicate CTE of the delete is not updateable. So you cannot update a CTE. A delete statement is like an update statement, essentially. So what we are going to do is we're going to do something a little bit different because this is how I would love to do it. That makes it super, super easy to remove duplicates. But that is not always the way that things happen in the real world. I think what we should do is take this right here, and let's run this. We should take this right here and put this into, let's say, a staging two database. And then we can delete it because we can filter on these row nums, and we can delete those which are equal to two. So it's essentially like, you know, creating some type of table and then just deleting the actual column. So we're, that's exactly what we're going to do. So it's essentially just creating another table that has this extra row, and then deleting it where that row is equal to two. So, you know, somewhat fairly straightforward, but let's try it. And let's see what happens. So we're going to come down here and do is create our table. Let's try doing that with here. Let's, let's copy the clipboard a create statement. Let's see if this works. Perfect. That's exactly what I wanted. Now, all we're going to do is say we're creating the table layoff staging two. Now this is a create table statement, and we're naming the columns, and then we're also assigning the data type. So we have all these things, but we want one more. I do a comma, and we want to add row underscore num, and I need to underscore num. And that should be an integer data type. So we'll just keep it just like this. Let's go ahead and copy this. And let's run it. See if it worked. Bring this up. Looks like it worked properly. And let's say, go back up. I want to rewrite things that I don't have to. Let's run this. So we now we have this empty table. So we want to insert this information right here. We're going to insert into insert into and then we'll do this right here. So insert into staging two. Now let's try to run this see if it works. And let's run it. And let's select that table. And now we have it. So let's pull this back up and I'll walk through what we just did because I know I'm going quick, but we have so much to cover in this lesson. So we just inserted basically a copy of all these columns. But in this new table, we added one more the row num. So now we can filter, we can say where when you spell that right, where row underscore num is equal to two or we should should say greater than one because some might have multiple duplicates. And there you go. Here are our duplicates. Now we're going to delete these. So all we have to do is come right back down. Where'd I go? Copy this, come right back down here. I'm just going to say delete from. We just did a select statement. I always recommend doing that to identify what you're deleting. Then you change it to delete. And now if we run this, go and I'm actually going to keep this. Let me see. There we go. And let's run it again. And now they're gone. And if we say just the whole table, this looks wonderful. Now this row num is going to be a column at the end that we probably don't need anymore. Right. So redundant column. It adds up extra space and memory and storage and all these other things and processing times. We're just going to get rid of it. That'll be at the very end, I'm sure. So it looks like we are good to go. That's how we remove duplicates. Now there are different ways to do it when you have different columns. Like if you have a unique column over here makes it so much easier. So, so, so much easier. But we didn't have that. So we had to kind of do a workaround. Welcome to the real world. Now let's look at standardizing data. So standardizing data is finding issues in your data and then fixing it. So I'm already noticing right here, it looks like we have a space at the beginning. We could easily just do a trim on this column. Unless I'm, I don't even think I was, I did this when I wrote out all the scripts for this. Let's just do from this table, why am I writing it all out again? We actually want to select the company. And then the, or actually we'll just do distinct company. Distinct company. Let's run this. And if we do a trim around this, let's run this again. And that looks better. So if we do company, company, comma, and then we'll just do the trim. I don't want to, we don't need to do distinct right now. We'll do the company. This just looks better. So we're going to update that. It's super easy. Now if you ran to niche you just a second ago, I may need to help you change that. So if you couldn't update or delete those things earlier, I should have told you this earlier. I apologize. All you need to go is to edit. You just need to go to edit, go to preferences at the very bottom, go to SQL editor, go all the way down to the bottom. And right here, we have safe updates on, if you have this selected, that means you can't update anything. That's a problem. So what you need to do is select this or unselect it like I have it and save it. You may have to even restart your MySQL potentially in order for the changes to take effect. But then you should be able to update that. Now all we're going to do is update this table. And we're going to set, and now we need to come back here and we'll say we're going to set the company equal to trim. Now if you don't know what trim is, or you haven't taken that lesson, trim just takes off the white space off the end. So it took the white space out of here, or off the right hand side as well. So we're going to update this and let's do a semicolon, a semicolon, let's run this, let's select this again. And it was updated properly. So we're already off to a great start. Now the next thing that I want to take a look at is the actual industry. Let's go back, copy this, and let's take a look at the industry. So we'll do industry, and we'll run it. Now if you look in here, there's a ton of different industries. And there's marketing and marketing. Oh, because I haven't done distinct. Please ignore me. Let's do distinct. And there's a ton of different industries in here, transportation, healthcare, consumer. There's a blank one, which we'll take a look at. Aerospace. There's a lot of really unique ones. Let's actually order this. We'll do order by, and let's just do one, which is the first column, we're just ordering on itself. So we have null, we have blank. That's a problem. We'll take a look at that later. But this is an issue, crypto, cryptocurrency and cryptocurrency. These are all the same thing. These should all be on or labeled the exact same thing. The reason we need to change this is because when we start doing the exploratory data analysis, visualizing it, these would all be their own rows or own unique thing, which we don't want. We want them all to be grouped together so we can accurately look at the data. Take a look at any other ones. FinTech and finance that could be the same thing. I'm not 100% sure. I'm not a FinTech person. I think for now, the only one that I'm confident in changing is this one right here, which is cryptocurrency. So let's go ahead and update that. So all we have to do, and we need to actually let's select really quickly where it's like crypto. So we'll say where industry and we want to select everything where the industry is like. And we'll just do crypto. See, I'll start with crypto, right? Yeah, we'll do crypto just like this. And let's run this. And let's just take a look. A lot of layoffs in the crypto industry. Good night. All right, let's find where it's cryptocurrency. Okay, so even this one, it's crypto and I know Gemini crypto, crypto, and then says cryptocurrency. So these should be all crypto. You see how 95% of them are crypto. So we're going to update these other ones. Oh, this one is CRYPT. Is that a crypto? I don't know anything. All right, so we want to update all of them to be crypto. So what we're going to do is we're going to say update layoffs industry two, we want to set the industry equal to crypto, just like this, where, and we can do it a few different ways, we can say industry, I think we can do like, let's try this real quick. I have some of this stuff I don't have planned out, I'm just kind of going with it as we go, which I like better. You know, we kind of we work together on this, we figure these things out together, that's what I like. Then we'll do like crypto just like this, exactly like we had it up here. So if it's like crypto, it should be crypto. Let's try this. Let's see if it ranks, it may not have, I can't remember. Yeah, it worked. Okay, so it updated three rows and that looks correct. Now let's go back up and let's run this. And as we scroll down, they are all the exact same, beautiful, beautiful, beautiful. So if we do distinct industry again, let's get rid of this. If we run this query, and we scroll down crypto is its own thing, beautiful. And it looks great. We can look at those later on how we can update those. But let's keep going. Let's look at our whole table again. And these blanks and these nulls are actually an issue, we do need to deal with them. But my instinct is telling me go fix it. But my, you know, tutorial side is saying, okay, stick with the tutorial, the order that we agreed on. So let's go take a look. We've looked at company, we've looked at industry. Let's just real quick look at a distinct location. Now it's good to look at most of these things, right, there could be small, tiny issues that you just never saw. And we're just going to order by order by one, just do a real quick, just a scan to see if we find any issues. That could be an issue. But that could just be another language, if I'm being honest, I don't know. As I'm just scrolling through here because I want to make sure, because this is not something I had in my prewritten script. This looks pretty good to me. Let's do everything. We'll run this. And now let's look at country. So we'll do distinct country. And let's run this. And let's scroll down. Again, this is sometimes just what I actually do. All right, we got an issue right here. Super common. Somebody put a period at the end, some dingus. And we're not going to judge that person. I don't know who was or who ruined this data set. But that's a problem. So we're going to need to just update that it looks pretty simple. But I'll just say where country is equal to or let's say like, and then I'll say like, United States. There we go. And oops, I want to say select everything. I just want to see where it's at. Oh, geez, there's too many. Let's see if I can spot it. I can't spot it. It looks like they're supposed to be United States, not United States dot. That's the issue. We can easily, easily fix this. And we can probably let's do really quickly. Let's do select, oops, like distinct. And then we'll do country comma. And then we'll do a trim, because we want to get rid of that. That one, we'll do country. Now just doing the trim won't fix it. Let's go to the bottom. So doing the trim doesn't fix it. But here's what you can do. It's a little trick of the trade here. We're going to do something called trailing, which means coming at the end. So what's trailing the period from country? Let's try running this. Scroll to the bottom and it fixed it. So this is a little, a little advanced little tidbit for the trim here. We can do trailing from the country. We're looking for something that's not a white space. We're specifying we're looking for a period. So now we can do is we can say update. We can set the country, update this table. And we'll, oops, and we'll set, what am I doing? What's going on here? We'll set the country equal to, and we'll do it just like this, but we're only going to do it for a country, right? So we'll say is equal to trim. And we'll say where country is equal to, or actually, let's say like, and let me see if I have this. Oh, I don't. Let's, let's just say like United States, like we had it before. Just like this. So let's go ahead and update this after I put my semi colon in, let's run this. And let's run this again. It shouldn't need to fix it anymore. It's just one row. That's perfect. That's exactly what we wanted. Now, one thing that's really important, and this is, you know, this is a longitudinal, it's not the right word at all. Give me a second. I can't, I can't speak and write at the same time. So sometimes I just say dumb things. If we want to do not longitudinal, but time series, that's the word I'm looking for. If we're trying to do time series, exploratory data analysis, time series visualizations later on, this needs to be changed. Right now it's text. And we can look at that by going right, actually, let's refresh this. We're not looking at staging, we're looking at staging two. If we look at the columns and we come down here to date, it is a text column. That's not good if we're trying to do time series stuff. We want to change this to a date column. Now, how can we do that? Let's take a look. So let's do date and let's not actually do it like that. Let's do date backslash. So we're just going to look at the date. Now let's change this because we want to format it how we want to format it with it, which is month, day, year. So how can we do this? Well, there's something that's very, very helpful, works perfectly in this situation and is exactly what we're going to do. It's called string to date. So we're going to string underscore there it is right there, underscore to underscore date. It literally helps us go from a string, which is a text, that's the data type to a date. So it's perfect. Now, all we need to do is pass through two parameters. We have to pass through the column, which is the date column, and then what format we want it in. Now, if you haven't done date formats before, I'm going to just kind of walk you through it while we're looking at it. In order to format this properly, you use a percent sign. So it's going to be a formatting for a month, a lowercase m. A capital M is something completely different. I believe it's spelled out. I need to, we can look at that in a second if we want to actually, and then we can do this right here, and then we'll do another one. So we're formatting it in the way that we want it, but also converting it to an actual date column. So now we want month, and then we want day, lowercase day, we'll do a forward slash and then another percent sign, and then a capital Y, which stands for, I believe, the four number long year. Let's just, let's look at this real quick. So it worked perfect. So we're, it's taking in this format that it's in right over here and converting it into the date format. So this is the standard date format that you're going to find in MySQL. Now let's see what happens really quickly just for fun. Let's see if we do capital M. It looks like that's not going to work at all. Let's do lowercase Y and just formatted it to 2020. I think it took the first two numbers. It looks like, I don't know why it's doing that if I'm being honest. But if we keep it with the capital Y as we should, this looks perfect. This looks exactly like what we're trying to do. So you can mess around with it. It depends on how the data is formatted in your original column when it converts it to the string to date. And there's a lot of different stuff. You should just look up date formatting in MySQL, really interesting stuff. So we're going to update this date column to this, which is our new date column. Let's go ahead and do that. We're going to say update. You guys should be getting used to this by now. That's the whole point is getting used to doing these things. So we're going to set date equal to, and then we're going to put in this right here, the string to date. Go ahead and do this. And let's run it. Make sure it worked. 2,355 rows. It looked like it did every single one. But let's go ahead and get rid of this. And let's run it. And it looks like it worked perfectly. Now there were some nulls. It looks like, and that'll be something we have to look at later when we talk about nulls. But overall, I believe this looks proper. Now if we refresh this, to refresh, come down to the date, you'll notice it is still a text. It's date. It's called text. But now it's in the date format. Now that's really important. And maybe I should have done that earlier, if I'm being honest, tried to convert it to a date column. It wouldn't work. It would give us an error. You just have to trust me on that one. But now we can do it where we can change it to a date column. So let's do alter table. Now only do this, never ever do this on your raw table. Only do some things like a staging table. Because we're about to completely change the data type of the actual table. So we want to change the layoff staging to, and then we're going to come down here. I'm going to say modify column. Now what column are we modifying? It's this date column. There we go. And we want to change it to what data type? A date. And am I spelling this right? Yeah, I just need a semicolon here. I've never seen an error. I always get a, just look for the semicolons. So let's go and run this. And let's refresh. See if it worked. And the date was changed to a date, which is perfect. That's all we wanted to do. Just to make sure we were doing what, or we'll set ourselves up later in the future really well. Let's look at our table. All right. This is very good. So we fixed a few, just issues with the company. I believe something with the industry out of the cryptocurrency. We changed the country. I'm just going to go ahead and tell you right now, this one, we're not going to look at until we look at the nulls and whatnot in just a second. So we're not looking at that one yet. And then we have this extra column that we've done. So we've done a lot so far, but the next thing in the process, step one was removed duplicates up to with standardization. Step three is working with null and blank values. Now this is going to happen. You're going to have nulls and you're going to have a blank values in here. I did somewhere. It's just going to happen. And so we need to think about what we're going to do with that information, whether we want to make them all nulls, make them all blank, try to populate that data. Let's see what we're going to do. So let's start off with the total laid off. We'll just do where total underscore laid underscore off is null. So in order to look at the null, we say is null. Let's try equal to null. It's not going to give it to us. We have to say where it is null. So we have these values. These are completely null. And there's quite a few of them. But remember, this is also useful information. But if they have two nulls, that probably is pretty useless to us. That's something I think we'll take a look at in a little bit, actually, we'll say, and we may save this query, percentage laid off is null. So if they're both null, like these, these are all, I believe, fairly useless to us. These might be ones that we remove. So let's actually look at this. In step four, we look at removing rows and columns. But one thing we should take a look at, I remember this industry, let's do industry, do distinct. This industry had some missing values. And let's take a look at that. Okay, so we have a missing value and we have a null here. So let's look at this query. Let's say where industry is null, or the industry is equal to a blank like this. We'll select everything to run this. All right. So it looks like there are a few that are blank. Now, what we can try to do is see if any of these have one that's populated. Let's take Airbnb, for example, let's search for this really quickly. And this is 100%, you know, it's just helpful. It's really, really helpful to be able to populate data that is pop, populatable. Is that a word? Let's try it. So we'll say, select everything, I just want to do where, spell that right, where company is equal to, and let's do Airbnb. There we go. Let's run this. And it looks like we have this one right here. So for example, these, whether they have them or not, we're going to try to populate these. If this ballies or Carvana or Jewel had multiple layoffs, these ones should, if these ones aren't blank, if they have one that's not blank, we should be able to populate it. For example, not the one I was trying to do. If we look at Airbnb, this one has travel. So we know this is the travel industry. So we can populate this with travel. Again, we want this data to be the same. So if we're trying to look at, you know, what industries were impacted the most, this row isn't going to be affected or this row won't be in our output because it's blank. We want that to be traveled to represent the data properly. So we want to update it. So if this one has travel, we should be able to update this row with this travel right here. So let's see how we can write this. And let me give myself some rows right here. All right. Now what we're going to need to do is try to do a join here. So let's try writing out in a select statement, and then we'll just change it to an update if it works. So we're going to select everything. And we're going to do this from staging two, from staging two, and we'll call this st2. And then we'll join on itself, because what we're going to do is we're going to check in this table, does it have one that is blank and not blank? If so, update it with the non blank one. That's essentially in layman terms what we're trying to write, but writing it out could be a little bit more difficult. So we're going to join on itself. And we'll call this, let's actually call this table one t1 and t2, because they're the exact same table. And we'll do this on, and we're going to say t1.company is equal to t2.company. So the company has to be the same. That's important. And we probably should do the location is the same as well. Now we'll do and t1.location is equal to t2.location. I'm imagining, you know, there's another Airbnb in like South America somewhere that's called Airbnb, but you know, I'm just imagining a scenario, right, where we have to think about different use cases rather than just large companies. So those other ones, they may have ones that are in different locations. We don't want those. We don't want to change them if they're not the same. So these are the same. Now what we want to find is we're going to say, oops, we want to say where, and we'll do t1.industry is null. And then we want to check that t2.industry is not null. We'll say and t2.industry is not null. And let's just run this. Let's see if we get anything. So let's think this through because we got nothing in our output. We're selecting everything. We're joining on the company and the company and the location where t1.industry is null and t2.industry is not null. Let's just get rid of this for a second. I just want to see if this changes anything. It doesn't. And it's possible actually that instead of doing is null, we could do or, and I'm glad we're walking through this, we can do or is equal to blank. And let's try running this. There we go. Okay, so it looks like there's jewel, Carvana and Airbnb. These ones all have industries where it's null or blank and an industry is not null. So that's really good. Now, if we scroll over, see the industry here, this is our t1. This is our first table. If we scroll over, I bet we'll see the t2 industry where it's not null. Let's scroll over. And here's our industry. We have travel, transportation and consumer. So this worked exactly as we had hoped. I can even pull this up here just to show and I'll show you a little bit easier what that's doing. And we'll do t2.industry. This is kind of like what we're trying to do. So if it's blank, this one is going to be populated into here if there is one that is not blank. So that's essentially what we're going to do. Let's write the update statement and we're going to see if it works this, we have to translate this to an update statement. So we'll do update. And we're going to update this right here. So we'll say update t1. And then we'll do the join right there. And now we have to do a set statement. So we'll set the t1.industry equal to, and I'll just copy this t2.industry. I don't like writing things out. Then we say where. So we do this. It's like that. And it's out of semicolon. Let's confirm. So we're updating this table t1. We're joining on t2. Or the company is the exact same. We're setting t1.industry equal to t2.industry. So the t1 should be the blank one. So where the t1.industry is null or blank. And t2.industry is not null. Let's go ahead and run this semicolon. See if there were about three updated. Yep. Rematch zero was affected, though. Let's go take a look. We have to run this query. Looks like those are still null. Let's run this. That one is still blank. Now let me think here. I'm trying to think of why this didn't work. And I want to walk you through my thought process. It is possible that because these are blanks and not nulls that it's not working. And I will say that it's something I typically do where I set these blanks to nulls first. So let's actually try that and see if that changes anything. I'm just going to update this. I'm going to say set the industry equal to null. And we'll say where industry is equal to blanks. So we're just changing it to null where it's blank. Let's try this. Let's go back down here to our select statement. So these are all nulls. Okay. I think this is now going to work. Because now you can see on this side, there's only one option for it to populate it. Before there were those blanks, which I think was causing the issue. Let's get rid of this part because now we have no nulls. And now let's try running this. We're workshopping this on the fly, guys. Let's see. Three rows affected. Hey, oh, all right, let's go see if it worked. Let's run this query. And we have none. That's perfect. Let's look at Airbnb. Hey, all right, all right, ran into some issues. But we worked through it. We figured out the issue. And now it's working properly. And we can even come back up here to select everything. And it looks like Baileys is the only one that still has a null. Let's look up Baileys real quick. And we'll say we're company is like Baileys. Let's run this. Yeah. And there's only one. So there wasn't another row. All these other ones like Carvana and I remember the other Jewel and Airbnb. Those ones had an extra row. They did multiple layoffs. This one only did one layoff. So we don't have another populated row where it's not null. It can actually populate the null row. That's really all that happened. That's why that worked that way. So I'm really happy that worked. Awesome job, guys. I was starting to question myself. Do I even know how to use my SQL? I mean, I was really starting to question my abilities here. Take a look. I think that is all we're going to do for populating null values. Now, here's why things like total laid off, percentage laid off, funds raised, how are we going to populate that with the data that we have here? I don't believe we can. Now, we might be able to populate oops, we might be able to populate some of this if we had the company total, like if we had the original total before laid off, because then we could do calculations like oh, these companies went completely out of business. That's not good at 1%. That means 100% was laid off. But if we had, you know, the total, they had 50 employees and 100% were laid off, we could populate the total laid off. Whoops, did it again. We could populate the total laid off by saying if this is 50, 100% was laid off. That's 50 people were laid off. We don't have that data. So we can't go and populate it. I don't believe funds raised, we might be able to scrape some data from the web and populate this. But that's a totally different thing, not part of this project. So I think the data cleaning for the null values and blank values, I think that's going to be done. It's possible that the stage could be the same. And if you want to go check, you can, but we're going to keep checking along, because we want to remove columns and rows that we need to. Now, if you remember, we were looking at this before, did I save that query? Let's go look. Here we go. Bring this down to the bottom. These rows, let's really take a look at these and think about if this is going to be helped us. What we are trying to do with this data in the near future is we're not just trying to identify a company or a location that had layoffs. And maybe we are. Maybe we are trying to do that. But these have no layoffs and no percentage laid off. So in my opinion, I don't know if these laid off any at all. I believe that we can get rid of these. Now deleting data is a very interesting thing to do. You have to be confident. Am I 100% confident? No, not really. But I'm confident enough to know that what we're about to look at in the next one, we're going to be using these total laid off a lot, percentage laid off a lot when we're looking at, you know, actually querying the data and doing some exploratory data analysis. So we're going to use these a lot. I don't think these, I'm not even sure if these are accurate. I'm not even sure if they actually did have a layoff. It's saying they did, but it doesn't show if they laid off any. So can we delete this? Yes. Should we delete this? It's iffy. I'm not 100% if I'm being completely honest. And there's a lot of rows like that. This is, this could be like 100 or so. I mean, I could run a query and run it, but I don't want to, it's not a big deal. The point being, I don't think we need this information. So we're going to get rid of it. If nothing else, this just showed you can do it. So now we'll say delete and then we'll do from here. There we go. So now we're going to delete these rows. Let's try to select them again. And they are gone. So we deleted the ones where the total laid off was blank and the percentage laid off was blank. We just, I can't trust that data. I really can't. And let's go back down. I'm right here. semi colon. So I sometimes I have to walk myself through these things. All right. This run them. I mean, come on. We don't need that anymore. Let's get rid of it. So what we can do now it's a little bit different syntax. We want to drop a column from this table. So we have to do the alter table again. We're going to alter table lay off staging two. And then we're going to say, drop column and row underscore num. If we run this, then we run the table again, should be gone. And it is. So this is it. This is our finalized clean data. Now in the next project, we're going to be doing exploratory data analysis on this cleaned data. We're going to finding trends and patterns and running complex queries. It's going to be phenomenal. I'm super excited about it. And I love this data cleaning one. I made some mistakes. I'll be the first one to admit, but cleaning data is not always a straightforward thing. You know, you have to kind of mess around with it, figure it out. And, you know, that's what we did. Whoa, took a while. So just to recap, we removed duplicates. We standardized the data. We looked at the null values or blank values. Then we removed any columns or rows. So we did a lot. If you go back and you actually scroll through here and look at some of this code that we wrote, it's not super beginner stuff. So if you're following along with these things and you are getting this project, this is a fantastic project to put on your portfolio. I myself would put this project on my portfolio because it's a very, very relevant thing. So I hope this was helpful. I'm just going to keep scrolling while I talk. But I hope this was helpful. I hope you learned something. We did a lot of different things that we didn't even do in the lessons, which I like doing because you can't cover every single aspect of my SQL in lessons, right? Sometimes you just got to get in there, get into the nitty gritty, clean some data and you'll find or discover new things, try new things. And now we're getting to the bottom and awesome work. Awesome, awesome, awesome work. This is an A1 project. I think this should be in everyone's portfolio. If I don't see it in your portfolio and you send it to me, I'm going to say it's the garbage portfolio. So this is a good one. So with that being said, thank you guys so much for watching. If you made it all the way to the end, you're still listening to me. Awesome work. Really awesome work for real. You're just following along with a tutorial. That's what it feels like. But by the end of this, I just know you're learning a ton and you're trying new things and you're really pushing yourself beyond just simple tutorials. So trust me when I say this is not easy. Not everyone was able to make it to the end. So great work getting here. So I will see you guys in the next project when we actually explore this data and walk through a lot of different ways to do that. So thank you again for watching. If you like this, be sure to like and subscribe below. I put out tons of content about all this stuff and I absolutely love it. It is definitely one of my passions in life. So go ahead and do that and I will see you in the next video.