 Thank you. Today I'm going to be talking about data cleansing with Pandas. Data cleansing I think is not everyone's favorite task. It can be very manual, it can be very time consuming and even somewhat tedious. But the tools in Pandas give you a really powerful framework for cleaning data. And I think once you learn those tools and accelerate your data cleansing productivity, you might start to have a little fun. So I'm going to run through some of the fundamentals of working with data in Pandas and we'll see how those can help you be more effective. The first thing I'm going to talk about is just the data frame object and how you go about navigating that in Pandas. Then we'll talk about slicing, splitting and joining data frames. After that we'll get into the different ways that you can do text transformations with Pandas and we'll talk a little bit about missing values and how you can fill in missing values and deal with those in various ways. Then we'll be looking at merging data frames when you want to join two different tables together, some of the options that you have there and techniques for bringing data together. And then finally we'll look at grouping by values in a data frame and working with duplicates. So getting started here with the basics, there are a number of different ways that Pandas allows you to get data into a data frame, which is the fundamental object. There is a, from CSV method which you can use to bring in a flat file which might be a common way that you get data into Pandas. So I have created some fake data here and put it into this variable DF and you can see in our very first item here we've used the head method which is just going to show us the first five rows of the data frame and you'll see throughout the examples here I've just tacked on the head method just to make things a little bit easier to see. So that's just going to show us the first five rows. But a couple of things to point out about the data frame. You can see that across the top we have in bold our column names. So that is our column index are those values. And then you can see on the left hand side we have a row index which is, we see zero through four here at the top of our data frame. So with those two indexers that gives us a ton of power to slice the data frame and to look at things in different ways. And you can see that the index and the columns are both attributes on the data frame object. So you can see there's a range index which is our indexer for the rows. And there's a columns index which is those values in the index object. The other thing I just wanted to point out here is you'll see we have our last order date column. We have date values in there. We also have NAN. And what that indicates is the null value object in pandas, which is going to give us a lot of power for dealing with null values. So that'll come up later, but I just want to note that. That's represented NAN there. That's the representation of a null value. At the bottom here you see we have sliced our data frame two ways. So using the LOC method, the LOC method, we've sliced first by rows and then by columns. So we have rows 10 through 15, and then we've just taken the city and classification columns. So that's kind of the basics of what's going on with the data frame object. So now we'll talk a little bit more about slicing, splitting, and joining. One thing to know about pandas is the data frame object is the two dimensional indexed table. The other key object we have in pandas is a series object, which is just an indexed array. So anytime I want to look at a column of my data frame as a series, I can simply put the column name in brackets after the data frame name. So here you can see the classification values are there, the first four values indexed. One of the really useful methods that we have on the series object is value counts. So you can see I've taken the value counts of my classification, and this is going to give me all the unique values and the number of occurrences. So we can see that the b value occurs 509 times. Another really useful method on a series is unique. So this is going to give us back an array with all the unique values. In this example, we're looking at all the states, all the different states that are here in this data frame. So a key concept when we're indexing data in pandas is the idea of a Boolean mask or a Boolean series. So there's a number of different ways that we can generate a series of Boolean values, and then we can use those as a mask or an indexer on our data frame. So you can see down here, I've set the variable class B mask equal to the statement where I've simply taken a pandas series and compared it to another value here, just a string B. So that variable then, as you can see below, is just an array of true or false values. So when that becomes very powerful is we use it to index our rows in our data frame. So here, I have the data frame sliced by the class B mask. And so you can see in the top, only rows where the classification is B are shown in this data frame. And then we can use the invert operator on our series as well. So you can see down here by adding in the invert operator below, we're going to get only those values where classification is not B. So those are the fundamentals of defining masks and splitting slicing data frames against them. There are a number of really powerful methods to create these Boolean series. The first one we'll talk about is null. So as I mentioned at the very beginning, there is the np.nan object that pandas uses to define missing values. So with isNull, we can find out where those missing values are. And we can see here, we're putting value counts, we're chaining value counts onto our series so we can see how many are true and how many are false. So looking at this, we know that 344 of our values in the last order date column are null. There's also a not null function, so you can see that's just the inverse there. Pandas also has a really powerful set of methods on a series related to strings. So you can see here we have string.contains, and that will take a regular expression. So it's very powerful. Here I've just put in a simple word, but what this is giving us is we can see that there are 54 cases in the account name column that contain the word group. Another method on a series for defining a mask that's really useful is in. So here you can see we have taken the state column and then we've used the isIn method and provided a list of states, New York, Connecticut, and New Jersey. And there we can see that we have 46 rows that have that state value. And finally, we can combine these Boolean series with different logic. So below I've said, we'll look at the account name that contains group and the last order date is not null. So we can see there are only 32 values where both of those things are true. And finally, on the splitting topic a really valuable pattern for cleaning data is to split on some mask and create two separate data frames. And then you can apply different treatments to those and then bring them back together. So this is a very simple example, we've created a data frame called TriState DF, which is anywhere that we have New York, Connecticut, or New Jersey as our state. And then we have outside TriState. So we've applied the invert operator there to the same mask that gives us everything outside of those three values. And then what I've done is just a simple example here. We created a new column called weighted customer value. And for local customers, maybe we want that to be a little higher. So we multiplied by a higher multiplier versus outside the TriState area. And then you can see down below we've sorted the values. There's the sort values method. We can specify the column we want to sort on. So I've sorted by a weighted customer value with the highest at the top. And we can see that the multiplier is applied, the higher multiplier is applied for Connecticut and New Jersey here. So we're gonna move on now to looking at how we actually transform text. There are a number of transformations in the series.string methods. And just as one example, a very simple example, we can do string.upper. And that's gonna capitalize our names. So we can see we've got our count names capitalized there. There's also string.replace. Again, this takes a regular expression. I've just done a very simple replace here and just replaced a dash with a space. And you can see Medina McCullough is now as a space instead of a dash. So the string methods are very powerful. But a lot of times we wanted to find our own function that does something a little different. So with that, we can use the apply method on a series and apply the function to all those values. So I created this very simple function, customer code, that basically just breaks a string into words and then takes the first letter of each word and capitalizes it and sticks them back together. So you can see we've created down here customer code AG and that corresponds to Arnold Group. A lot of times we'll want to do something a little bit more complicated and create a transformation that's based on multiple columns and not just on a series. So data frame also has an apply method. And we can define a function that operates on a row. So this is basically the same function except instead of taking a string value, it now takes a row of pandas data. And we can reference the values in specific columns right here in the function. So now we say row sub account name dot split up here and then below we add in the dash and row sub state and tack it on the end. So now you can see our new customer code. We've got the same code but now I've got the state value from a different column also included. So it's a very simple example but it's a powerful construct that can be used to do all kinds of data transformations. And I'll just note when you use the apply method on a data frame, by default it's going to apply column wise. So that's why you can see the axis equal one argument added there. So what that tells of the apply method is apply row wise rather than column wise. Another method for doing transformations that's very useful in data cleansing is creating a map and then mapping values over. So I defined a very simple dictionary here that just has breaking six states into different regions. And then we can use a map function on a series and map that value and it's going to basically find the key in the dictionary and give us back that value. So here we can see we're getting regions back from state values. And we can very easily define a new column in the data frame using that mapping. Next thing we'll talk about is working with missing values. So here we've created a column called weeks since last order. And some of those values are null we can see there. So pandas has a fill and a method on series. And we could put in a string value there if we wanted to and just say if it's null we're going to say never. We can also calculate a value that we want to put in there. And in a lot of context this might make sense. We're going to take the max of weeks since last order that are in that column and fill all the null values with that maximum value. The next topic is merging data frames. So here we have a new table introduced. It's called IDDF and you can see it just has an account ID, an account name. And pandas has a merge method that we can use to bring together the IDDF table with our DF table. So you can see this pd.merge statement. Basically we enter the two data frames that we're merging as the first two arguments. And I have sliced the first data frame. A lot of times when you're doing a merge you may have a really large data frame. But you don't want all those columns in your merge result. So you can slice it right here in the statement. And so I've just selected out the account name and the city columns. And then you can define which columns you want to join on. So if the name is the same in both columns you can just use on. But if they're different you can use left on and right on. So you can see here we're saying for DF we're joining left on account name. And then for IDDF we're joining right on name. And you can see now we've got our account ID lined up with the city or whatever other data from DF that we'd want to combine with our account ID. One thing to note here is you might wonder, well how is this match working? What kind of method is being used here? So if you don't specify it's going to do an inner join. So you can see when you look at the length of this it's only 441 rows, whereas our data frame is just over 1,000 rows. So in a lot of cases that's not what we want. We want to keep all the original values and supplement. So with that we can specify a how parameter on our merge method. So what I've done here is the same merge but now I've added how equals left. So this is going to do a left join and we're going to keep all of our values in our data frame. And we can see that now we actually have almost 1,100 values. And you might say, well, why do we have, we only had 1,015 and now we have 1,098. What that indicates is that there must have been duplicate values in our IDDF. So when you're merging, ideally we'd like to merge on a unique ID in both data frames but a lot of times that's not possible. And we're trying to do some kind of messy entity resolution. So with this, we can look at where it's duplicated and we can break things down further from there. And so one thing we might want to do is look at all the values where we weren't able to match up, we weren't able to assign an account ID. And so what we can do is slice our data frame where that account ID is null. And so you can see here's a list of account names that we weren't able to match an account ID from our other table. And then we can again use a similar tactic like we did earlier of identifying the matched and the not matched and splitting those into two separate data frames. And then we might kick out the not matched to a CSV file and we might have to work on that part manually. And that is another strategy. Frequently you can do some work outside of Python, some manual work or maybe you have some resources that can help you there. And then you can bring it back in to Python and rejoin that data. The last topic we're gonna talk about is grouping by values and dealing with duplicates. So we've got a new data frame that we're dealing with here called OrderDF. And this is very simple, we just have an account name, an order amount, an order date, and a product. So if we wanna think about what's duplicated in this data frame, PANIS has a duplicated method on series. So we take the account name column and look at what's duplicated there and we get another Boolean mask. And we can take value counts off of that and we can see that we have 1,650 values that are duplicated. Now one thing to know about this duplicated method is by default, it's going to mark the first in a set of duplicates as not duplicated, and then the rest as duplicated as true. And that is, there's an argument called keep, which is defaulted to first. You can also set it to last, so it will do the opposite. Or you can set it to false as I've done in the next statement. And when you say dot duplicated keep equals false, what that's gonna do is flag anything that's part of a duplicate group as true. So that's often a helpful tool. If you wanna use our strategy again of splitting a data frame into separate parts, we could split it into the part that has duplicates and the part that doesn't have duplicates. And that might be helpful in working through our cleansing problem. We can also use a duplicated method on a data frame. We've been looking at it on a series, but we can also use it on a data frame. And what it's gonna do there by default is look for every single column in that row being part of a duplicate group. But we can define a subset of values that we wanna look at as far as duplication. So here I've defined the subset of account name and product. So here we're only going to get those rows where both the account name and the product are part of duplicate group. And you can see there that the number of duplicates identified that way is lower, which makes sense. So that's the duplicated function. Another really useful method when we're dealing with duplicates is group by. So what group by gives us is a group by object based on a column that we provide a column or you can pass multiple columns in a list and group by multiple columns. And with that, there are methods off of the group by object that are really powerful. So in this first example, we've grouped by account name and then we've used the filter method and we can filter on attributes of each group. So I've used an anonymous function here and we're saying filter where the length of the data frame is equal to exactly two. So you can see here these are the only results we're gonna get back in this data frame are where there is an account name with two orders on it, essentially. Another really useful function for group by is ag. So that allows us to do aggregation and we can pass into the ag method. We can pass a dictionary that has a column name and then the function that we wanna use to aggregate. So here we're saying we want to aggregate order amount based on sum. So this is going to group by account and then we get the total amounts there. And then finally, we can also have an apply function against a group by object. So with that, we basically get a data frame for our duplicate group and then we can complete various operations against it. So I define a dedupe orders function here and it sums up all the orders in the group. It counts the sales, it then identifies the most recent purchase state and takes the product that aligns to that row. And then it's only gonna give us back values that have at least one product B value in it. So that is the end of the talk here and I think we're out of time. Thanks everyone for joining.