 Hello everybody. Today we're going to be cleaning data using pandas. Now there are literally hundreds of ways that you can clean data within pandas, but I'm going to show you some of the ones that I use a lot and ones that I think are really good to know when you are cleaning your data sets. So we're going to start by saying import pandas as pd and we're going to run that. And now we're going to import our file. So we're going to say data frames equal to pd. So that's pandas read underscore and we actually have this in an excel file. So we'll say read oops say read excel do an open parentheses and we'll do our and then we'll paste the path right here. And now we're just going to call that variable. So we'll call data frame and we'll actually read it in and look at the data. So let's scroll down here and let's take a look at this data frame or this excel file that we're reading in. So we're off the bat we have this customer ID that goes from 1001 all the way down to 1020. We have this first name and everything looks pretty good here except in this last name column looks like we have some errors we have some forward slashes some dots some null values. So definitely going to have to clean that up because we don't want that in the data. We have a phone number and it looks like we have a lot of different formats as well as nays not a number just lots of different stuff. So we're going to standardize that so clean it up and then standardize it to where it all looks the same. We also have address and it looks like on some of these we just have a street address but on some of the other ones we have like a street address and another location as well as a zip code and some of them. So we'll probably want to split those out. We have a paying customer which is yes and no and some of those are not the same so I have to standardize that. We have a do not contact kind of the same thing as the paying customer and we have this not useful column which we'll probably just want to get rid of. Okay so the scenario is that we got handed this list of names and we need to clean it up and hand it off to the people who are actually going to make these calls to this customer list. So they want all the data in here standardized and cleaned so that the people who are making those calls can just make those calls as quickly as possible but they also don't want columns and rows that aren't useful to them so things like this not useful column we're probably going to get rid of and then ones that say do not contact if it says yes we should not contact them we probably will want to get rid of those somehow. So that's a lot of what we're going to be doing to clean this data set. Normally the very first thing that I do when I'm working with a data set most of the time except very rare cases when you're actually supposed to have duplicates is I actually go and drop the duplicates from the data set completely. All you have to do for that is say df dot drop underscore duplicates. So they make it super easy for you let's just run it and up here is our original data set we have this 19 and 20 and those are obviously duplicates they have the exact same data it's just a duplicate row that we need to get rid of if we look right down here we no longer have that 20 we now just have one row of Anakin Skywalker and of course we want to save that so we're just going to say df is equal to and df so now it's going to save that to the data frame variable again and now when we run this our data frame now does not have any duplicates it's definitely one of the easier steps that we're going to look at things are going to get quite a bit more complicated as we go I'm starting out you know kind of simple so that we can kind of get a feel for it and then we'll start getting into the really tough stuff so the next thing that I want to do is remove any columns that we don't need I don't want to clean data that we're not going to use so if we're just looking through here you know they may need you know first name last name phone number for sure address might give them some information of where they're calling to or time zone so we want that this not useful column looks like a pretty good candidate to delete and it's very easy to do that we're going to go right down here and we're going to say df dot drop we'll do an open parentheses drop just means we are dropping that column and we can specify that by saying columns is equal to and then we'll paste in that column that we want to delete so let's run this and see what it looks like and it literally just drops that column exactly like we were talking about and no longer has that column again we want to save that we can always do in place equals true if you follow this tutorial series you can always do in place equals true and that'll save it as well but just for our workflow most of the time I'm going to assign it back to that variable just for keeping it the same really quickly I wanted to give a huge shout to the sponsor of this entire panda series and that is you to me you to me has some of the best courses at the best prices and it is no exception when it comes to pandas courses if you want to master pandas this is the course that I would recommend it's going to teach you just about everything you need to know about pandas so huge shout to you to me for sponsoring this panda series and let's get back to the video now let's kind of go column by column and see what we need to fix and we'll start on this left hand side this customer ID to me looks perfectly fine I'm not going to mess with it at all the first name at a glance also looks perfectly fine I don't see anything wrong with it visually which is a good thing although sometimes that can be deceiving and it can cause errors down the line but we're not going to assume that there are errors in here now let's look at this last name now the last name obviously I'm seeing some obvious things things that we talked about when we were first looking at this data set we have this forward slash which we definitely need to get rid of we have null values so not a number right here we have some periods as well as an underscore right here so all those things I think we should clean up and get rid of it so that when the person is making these calls you know it's all cleaned up for them so how are we going to do that we can actually do this in several different ways but let's just copy this last name the first one I'm going to show you is strip and write it kind of like this we'll say data frame and then we'll specify the column that we're working with because we don't want to make these changes or strip all of these values from everywhere we only want to do it on just this column if we do this and we don't specify the column name it will apply it to everywhere so if we're trying to do these yeah let's say buh buh bum these underscores maybe that would mess with something else in another column and we don't want that so we just want to specify just this last name so let's go last name dot string dot strip now what strip does and let's see if we can open this up really quickly we can't but what strip does I was just I was hitting shift tab in here to see if it could bring up you know some of the notes on it but what strip does is it takes either the left side or the right side well l strip takes from the left side r strip takes from the right side and strip takes from both but you can strip values off the left and the right hand side and we can specify those values now for what we're doing in this column we can just use strip because as you can see this forward slash these dots as well as this underscore are all on the far sides if there was a value like swan underscore son the strip wouldn't work at all because it's not on the outside of the value of the word so we can use strip I'll also show you how to use replace and replace is another really good option for things like this but let's start with strip and just see what it looks like and see if we can get what we need done so let's just run this for now see what happens so it looks like nothing has changed because again we're not specifying any specific value just by default it's only taking out white space so like spaces that shouldn't be there that's what it does by default now we can specify within this exactly what values we want to take out so let's go ahead and do that let's say left strip and let's try to take out these dots real quick so we're just going to do a parentheses dot dot dot now let's run this and see what it looks like for this one Potter it is now gone so those three dots were there before let's just show it so they were there and then when I ran it like this now they're gone that's what the L strip does it takes it only off the left hand side now we can also do a forward slash so we'll do something like this and you'll get rid of the white but as you can see now we aren't taking out these three dots so they're still there now is it possible to do something like this where we put these values inside of a list let's try it so we'll say just like this one two three let's run it and no it doesn't this L strip actually sits within the realm of regular expression so if you've ever worked with regular expression you know it gets very complicated very complex so you want to keep it kind of simple especially with these values we're just taking a few out so what we're going to do is we're going to do dot dot dot and we're taking out one by one now in order to save this because we want to save this we want to take out that value we don't just want to say data frame equals because that would be very bad what this would say is now this data frame is only equal to these values that we're seeing right here we want to only apply it to this column so we're going to go like this so now when we do it and then we call the entire data frame it's only applying this to this one column the last name column so let's run it and now when we go down to order right here it's cleaned up so we're going to do the same thing but for those other values and we'll do it just like this we'll do a forward slash and it's a left strip and then we'll do I'll do the left strip on this underscore to just to show you that it won't work and then we will go on from there so it's not pulling it because we're looking at the left hand side only we need to use our strip so now let's use our strip and now that looks perfect there's no underscore so that's how you can use strip for either the left side the right side or just strip by itself which covers both sides now I showed you all of that because I am going to show you a different way to do it and I apologize because I somewhat lied to you earlier um let's run this right here actually we're just going to pull it in like this we're going to remove the duplicates again bear with me we're going to drop that column and then now we're sitting with that data frame again with those exact same mistakes I just wanted to reset it for a second there is a way uh that you can do this and I just wanted to you know kind of show you how you can do it you can do this right here and we'll say so we're now again we're just looking at this column just this column and we're using strip and let's get rid of r because we want to do apply it to everywhere you can input all of those values individually and it will clean it up so let's say we want to get rid of numbers we'll do one two three then we could do the dot so that's going to be for our period or for our dot dot dot potter we could also do the underscore and we can do the forward slash so we put it all in one string right here now let's take a look at this we'll get rid of this really quickly now let's take a look and all of them were removed I showed you how to do it before because that's at least how my mind would think about it I think oh I can put it in a list and run it through this L strip or this right strip and it would work but that's not how strip works you have to kind of combine it all into one value so uh yes I deceived you I apologize but now when we call data frame and we assign it to that column so the last name column or assigning what we just did to this last name column everything should look perfect and it does so our customer ID first name last name are all cleaned up now we're gonna come to a much more difficult one this is probably if I'm being honest the hardest one I said we were gonna work up but this is probably the hardest one of the whole video working with phone numbers and look at all these different types of of formats I mean it is um it's not going to be fun and imagine you know there's 20 000 of these you can't just go and manually clean those up you need something to kind of automate that so that is what we're going to do so let's go right down here we'll copy the data frame and I'm going to pull it right here so now we need to clean up this phone number what we want is it all to look exactly the same unless it's blank and we'll keep it blank we don't want to populate that data but we want all of them to look exactly like this one and what we're going to do is right off the bat we're going to take all of the non-numeric values and just completely get rid of them strip it down to just the numbers so this one two three dash six four three or forward slash will just be the numbers same with these bars and these slashes and everything all of these will just be numeric then we'll go back and reformat it how we want to format it which will look exactly like this one um but we just want to do it for the entire column so let's go right up here and we're going to try and replace for the first time so let's do phone number do it just loops that's not what I wanted so we're going to do a bracket let's say phone number dot string dot replace just like we did before now we're going to use some regular expression in here and I'll kind of do a really high overview although I'm not going to dive super deep into the regular expression then we're going to do a parentheses and within there we're going to do a bracket um I can't remember what this is called is it called a carrot I think it's called a carrot but I'm just going to call it that it may not be correct but I think it's an upper arrow so it's an upper arrow a dash oops a dash z a dash z and then zero dash nine now at a super high level what that character that first thing is doing is saying we're going to return any character except and then we specify anything a to z a to z your upper or lower case and then actually I think this should be like this a to z uh and then zero to nine so any value like a b c one two three those are not going to be matched it's going to rematch all of them except these values and then we're going to replace them by saying comma and we're going to place them with nothing so this is just an empty string so literally we're taking everything that is not an a b c a one two three so a letter or a number we're replacing all of that and then we're replacing it with nothing so let's run this and see what it looks like and it looks like that worked properly now we do have this n a because we had an n dash a for I don't remember maybe that was creed breton um but it works for basically everything else we're going to go through the entire process and then at the end we'll remove any values we want them to just be completely null we don't want them to even see n a n and wonder what that is we just want it to be blank and we'll do that at the very end so now that we know that that worked let's assign it we'll do d f phone number is equal to and then we'll say data frame and this looks a lot more standardized than it did before already but now what we want to do is try to format this um and I've done this many many times I always use a lambda you can definitely use a for loop I just I don't do it that way myself so I'm going to show you how to do it using a lambda let's get rid of this and we're going to say d f phone number we've already done that I'm just going to get rid of it now we're going to say d f phone number then we're going to say dot apply we'll do an open parentheses and then this is where we're going to build out our lambda so we'll say lambda x colon now this is where we're going to kind of format it so what I want to do is I want to take the first three strings one two three then I want to add a slash and then the next three strings add a slash or a dash and then that be the value that's returned so it's not super difficult we're just going to do x then a bracket let me get rid of that an x and then a bracket and then we want the zero to three so it goes zero one two so zero one two it doesn't include the three it goes up to three so zero one two that's our first three values then we'll do plus and do a quote and do a dash so this is our first kind of sequence and I'm just going to copy this we'll do plus and instead of three we're we are going to start at three because that now it's inclusive so we're going to go from three and we're going to go all the way up to six so it should be three four five our next three values and then we have a dash and we'll copy this and we'll say plus and now we go from six all the way to ten now let's try running this and as you can see we get an error now I already know what the error is float object is not subscriptable which means we're trying to basically look at it like a string right now it's not a string it's actually a number so let me get rid of this for just a second I'm going to show you what it's talking about so right now we have values that are floats and values that are strings or not even a number so we have values that are strings or not a number so if we want to actually look through it like kind of like indexing if we want to do that they all have to be strings so we need to change this entire column into strings before we can apply this formatting now when I was creating this if I'm being honest my first thought when I was doing this was to do it like this string df phone number um let's just run that this is what the values look like um and I don't remember why or why it was doing this I can't I can't remember but I looked into it quite a bit and I was like oh I need to apply this string converting it to a string on each value not the entire row or not the entire column so how we can do that is actually fairly easy because we've already done a lot of the heavy lifting we're just going to copy this and we're gonna say x so string of x and again lambda is like a little anonymous function so you can do this by saying for um x in this uh column we could do a for loop and then say for every x it equals the string of x and then it changes it to a string but a lambda just does it a lot quicker um so we're going to say so let's do that really quickly and all of our values look exactly the same and that's how we want it so we're just going to copy this apply it good and now we're going to take this and we're going to run this again just ignore all my commented out stuff pretend I don't have that um so now when we run this it should work there we go now if we look at these numbers one two three dash five four five dash five four two one and it does that for every single one where there's values even when there's na n or na it's still adding those values but we expected that so let's apply it says equal to and then we'll look at the data frame and this looks almost exactly what we're hoping for we just need to get rid of these so this na n dash dash and this na dash we need to get rid of those and that is super easy to do we're just going to say so now that we've done it and we'll comment that out we'll say df and let's copy this ignore the messiness I do apologize for that it's very messy but if you're following along with me you get what we're doing so dgf phone number so only on the phone number say dot string dot replace no parentheses now we can specify this value so we want to take this exact value and replace it with nothing and let's just see if that does work it does it does now we have these na's and so let's actually I'll paste that right down here we're gonna do this is equal to and then we're just going to take this entire string put it right here and put this value as our what we're looking for and then replacing and then when we call that data frame it should work properly and it is perfectly cleaned so we have every single value all the exact same they don't have different characters or different you know formatting and we got rid of all the ones that we don't have or don't need all the ones that were just random values so this column is now completely cleaned up again definitely one of the more difficult ones ones that I've done a thousand times I've had to work with a lot of phone numbers and stuff like that this one does get very tricky especially if you have like a plus one which is like an area code that can get tricky as well but this is on a kind of a high level this is how you can do that and it's pretty neat how you can actually you know clean up and standardize those phone numbers so let's go right down here let's run it the next thing that we're going to look at is this address now let's just pretend that the people who are on the call center want all these separated into three different columns they can read it easier see what the zip code is where they live you know whatever they want it for let's just say we want to do that this is you know again for this use case it may not make sense but you have to do this I do this all the time um you need to split those columns now luckily all of these things are separated by a comma so we can specify that we're going to split on this column and then we'll be able to create three separate columns based off of this one column which is exactly what we want and we can name it as well and we can do that very easily by using this split so we're going to say df and we want to specify oh geez not again so we want to specify that we're looking at the address then we're going to say dot string dot split and we'll do an open parentheses now the very first value that we need to specify is what we're splitting on so we want to split on the comma so we want to specify that and then we need to specify how many values from left to right it should look for now we'll just start with one and then we'll go from there let's just see what this looks like so it doesn't really look like it did anything let's do two well let's go back to one and let say expand equals true when we expand it it's actually going to separate it I believe okay so we're expanding we're now we're only doing this with one comma so we're only looking at the very first comma and splitting it but in some of these well just in one there is an additional comma so we should do it up to two let's do this okay so now we have three columns if we just save it like this it's going to give us these zero one two these basically these indexed values for these columns and we don't want that we want to specify what these actually are and we can do that by saying df and let me just do is equal to we'll do bracket and then within there we're going to specify our list so we have three of them that we have so I'm going to do the first one this is the street address so we'll say street address the next one is and it's shire is not a state but these all are states I'm just going to say state and then the very last one that looks like a zip code so we'll say zip and we'll do under start code in fact I also want to do street underscore address um so what this is now going to do is these three columns are going to be applied to these three names and they'll basically be appended it doesn't replace the address we're not saying df address equals the df address we're not replacing it we're now creating different columns so let's run it and then let's also call it so they're right over here on this right hand side I couldn't see them at first but it did exactly what we needed it to do so now if we wanted to at the very end if we want to we're not going to we could just delete this address and keep the street address the state and the zip code another really common thing that you can do this happens often again with like first name last name we'll have alex freeberg but it's alex comma freeberg or alex space freeberg and you can separate those out into different columns now the next one that we want to look at is this paying customer and the paying customer and do not contact are very similar in the fact that it's yes no ny yes no ny and so let's go right on down here and we're gonna say df dot and we want to just replace these values as all yeses are all nos but just with the same formatting just to keep it consistent so let's make anything that's an n into a no anything that's a y into a yes I like it spelled out so let's change anything that's a yes into a y anything that's a no into an n that's usually how I do it just saves on data because it's less strings although it's you can be often very minimal um but let's specify the paying customer we'll suit say df bracket paying customer then we'll do dot string dot replace so now we're just going to look for those specific values so if it's a y a capital y then we'll say yes now let's run it and now we have no more y's we now just have yeses although now these are yes yeses okay we don't want to do that let's do if we're looking because it's taking it's literally looking up here and saying okay there's here's a y um let's change the that's changed that y into a yes so now it's doing yeses uh we don't want that so let's look for the yes and change into a y now when we run this that looks a lot better um so we'll do df paying customers equal to and then we'll copy this we'll do the exact same thing no and and then let's call it and now that entire column looks really good except for that value right there but i'm gonna leave that because i'm just going to apply it to the entire thing all at once to get rid of those at the end instead of just going column by column and then it's literally going to be the exact same thing so i'm not even gonna scroll down whoops i'm just going to put it right up here because this is the exact same thing i'm gonna save us all some time and when we run this this looks exactly like what we're looking for again some not a number of values but we can get rid of that in just a second by doing our place over the entire data frame and that is basically the end of cleaning up individual columns now let's go right down here we're gonna say df dot string dot replace and then we'll first do these values oops so we'll do replace let me do that there we go and replace that with nothing and let's just see what it looks like oops data frame object has no value string well that's because we were looking at columns before yeah i think i just need to get rid of this string we're not looking at we're just doing it across the entire data frame now let's try that okay that worked appropriately and we'll just say data frame is equal to and then we'll copy this and we'll do the n a n as well and we'll do and now when we do this it is not going to replace these because these aren't actually a value because we're looking for that string we actually need to use and i i completely forgot this i'm not gonna lie to you um let's get rid of this to get rid of those values because it's literally not a number there it is technically empty um i forgot we can do um or we could not even specify it we'll do df dot fill na so we're gonna fill these values if there's nothing in them we're gonna fill it and we're gonna say blank and when we run that every value that doesn't have something in it is going to show up blank even over here where we only had a few all of them throughout the data frame if it doesn't have a value it is now blank so let's apply that and we'll run this and now all of our cleaning we're actually cleaning up the individual columns is completely done we've removed columns we've split columns we've formatted and cleaned up phone numbers we've also taken values off of first name or this last name column and then we formatted and just kind of standardized paying customer and do not contact now they also asked us to only give them a list of phone numbers that they can call so if we take a look some of these do not contacts are why which means we cannot contact them and then there are some that don't even have phone numbers so we don't want to give the people the call center numbers that are people who don't have numbers so we want to remove those now there's a few different ways that we can do this but let's start with and we'll just go by do this do not contact it seems like the most obvious one now if it's blank we want to give them a call we only want to not call them if they've specifically said we cannot call them so if it's why we're not going to call them so what we need to do it's not anything like this we probably need to loop through this column and then look at each row that has a value of this and drop that entire row and we probably will need to do that based off this index instead of doing it based off just this column that may not make sense but let's actually let's actually start writing it so we'll do 4x in and we need to look at our index so we're just going to do let's do in df.index and we'll do a colon enter and then we want to look at these indexes how do we look at these indexes we use lock that's going to be df.loc and then we need to look at the value which is this x right here so each time it looks at the index it's looking at the value but we want to look at the value of this column do not contact I don't know if I copied this before let me copy it we only want to look at the value in this one column if we didn't it would look at a different value so we don't want that so we're looking at just that value if it's equal to y so if this value is equal to y then we want to drop it so we actually need to say if so if this value x in this column is equal to y then we want to do df.drop and then we'll say x and I think we have to say in place equals true here otherwise it won't take effect um otherwise I have to say like df is equal to df.yeah and I don't I don't want to start messing with that let's just do in place equals true um and let's see if that works I I can't remember if this is going to work or not invalid syntax okay nea colon and now let's try to run this okay okay yeah if we look at our index we can already tell that there are ones missing the one the one is missing the three is missing uh let's see and the 18 is missing so we already got rid of those values and you can you can see that there's no y's in here anymore which is really good we can if we want to and we probably should we should probably populate that um really quickly um let me just go up here really quick I'll copy this we probably should populate that and I didn't plan on doing this so um if it's blank oops it's blank give it an n and we want to attribute it to do not contact do not contact whoops let's see if that works and we probably need to do dot string let's just see if it works so if it's blank do okay I don't know why it's giving us a n n maybe there's maybe need to strip this or something uh okay never mind let's not do that but now we basically need to do the exact same thing for this phone number um because if it's blank we don't want them calling it um so we can copy this entire thing right down here and but now we're looking at phone number so now we're looking just at the values within phone number and we only want to look at if it's blank so if it literally has no value we want to get rid of it let's run this and see if it works again it should good and now our list is getting much smaller so you can see in our index a lot of um those rows were removed and okay good actually this worked itself out because these all have ends um so right now we're sitting really good everything looks really um standardized cleaned everything looks great I might drop this address if you want to you can drop this address but besides that this is all looking really good this paying customer doesn't uh the yes and no's aren't really anything um now we could and we probably should before we hand this off to the client or the customer call this we probably should reset this index because they might be confused as why there's numbers missing or you know they might use this index um to show how many people they've called her I don't know something like that so let's go right down here we're gonna say df dot and then we'll do reset underscore index and let's just see what this looks like um it does work but as you can tell it didn't get rid of that index completely actually took the index and saved that original one we do not need to save that whoops let's put it right in here now we're just gonna do drop equals true and when we do that it just completely resets it drops the original index and gives us a new index and that is what we want let's do df equals and this is our final product now one thing that I I you definitely could have done here um and I made this a little probably more complicated than it needed to be um that was just how my brain was working at the time when I'm you know typing this out we could have done df dot drop and a which is literally going to look at these null values before we couldn't do that with this one because these aren't we're not looking at na we're looking at y's so we couldn't do that but because we're looking at null values we could have also done drop and a um and done subset is equal to and then done it just on this phone number and then done like this done in place equals true so we could have also done this and then said df equals um I can't I mean I can run it it's just not going to do anything I can run it on the different column but that'll mess everything up but this is another way you can do it and I'll just save it in case you want to I'll say another way to drop null values there you go and that'll just be a note for us in the future but this is our final product it looks a lot different than when we first started I mean we had mistakes here completely different formatting in the phone number different address everything that we just talked about and this looks just a lot lot better and you can tell why it's really important to do this process because again we're working on a very small dataset I purposely you know created this dataset with these mistakes because you know when you're looking at data that has tens of thousands a hundred thousands a million rows these are all things that are going to be applied to much larger scale and you won't be able to as easily see them um you'll have to do some exploratory data analysis to find these mistakes and then you're going to need to clean the data or doing it at the same time when you're exploring the data so you'll clean it up as you go but these are a lot of the ways that I clean data a lot of the things that you can do to make your data just a lot more standardized a lot more um visually better and then it really helps later on with visualizations and your you know actual data analysis so I hope that that was helpful I know that this was a long video I'm sure it was but I hope that you got something out of this you learned some of the techniques on how to actually clean data in pandas if you like this video be sure to like and subscribe check out all my other videos on pandas as well as python and I will see you in the next video