 Hello everyone, excited to be here. Good evening to you or for me. Good morning Excited to talk about pandas. I have written a couple books on pandas. I wrote the pandas 1x cookbook I've done some machine learning books and and some Other python books. So I run a company called medicinic that is corporate train consulting and hopefully I have seen some of you in some Classes before but I'm excited to talk about pandas so if if you're interested in These concepts follow me up. I am doing some more hands-on courses on these So what we're going to talk about is loading our data Looking at the types of data doing chaining operations mutating data Using the apply method and aggregation. So this is not a beginner's talk with pandas I'm going to assume that you have some knowledge of pandas and This is going to be some best practices some things I've seen from Using pandas for a long time teaching pandas to thousands and writing books on pandas and seeing a lot of bad recommendations out there sort of my rant to help people so I'm just going to load my libraries here Let me just clear my cells here Okay, so I'm loading pandas. I'm using pandas 1.3, but you don't have to use that most of the things that I'm going to talk about here are applicable to Basically any version of pandas that's released in the past four years or so. I'm using a data set from To economy gov. It's a US website that basically every car that's sold in the US attracts some metrics about that so Like miles per gallon Information about the engine not sort of thing and this has If I scroll down here, you can see there's 41,000 rows and 83 columns so pretty big data set here So I'm just going to show some steps that I do and some processes. I'm going to pull out a certain number of these columns City mileage the combined highway and city mileage the highway mileage the number of cylinders the number of the displacement in the car The type of drive an engine description a fuel cost a make model Transmission type the range when it was created and the year so bunch of different types here and show Some practices that I like to do so one thing you might want to do is look at those v types And you can see that we've got integers floats and objects generally these objects here are going to be string types But they could be numeric types or date types as well. Let's look at how much memory we're using here I'm going to use be equals true to make sure that because I have string type I'm capturing the amount of string types here and let's just sum that up and So we've got like 19 megs of data now. I'm just going to go through cleaning these up a little bit I said we're going to look at the types So let's look at the integer columns and going to select the integer columns and then do describe method on them to look At the range here. You can see that for example the combined goes up to 136 highway goes up to 124 We've got some that are going up to like 7,000 or so But different ranges there. I actually wouldn't write this like this I recommend using this chaining style So what I do is I put a parentheses at the front that tells Python that this is a parenthetical like doing math Add addition before multiplication with math, but it also allows me to escape Python Indentation rules so what I'm going to do is I'm going to put all of these operations on a single line So this is the same as the code above here Just written on multiple lines because it allows me to very easily see what's going on Let me just show you here. I can sort of walk through this. Here's my original data frame. I'm going to pull out these columns I'm going to Select a few columns that are integers and I'm going to describe those columns So for example, maybe I want to change some of these column types by default Pan is used in 64 if I use the NumPy library I can see that the A int 8 which is an 8 bit integer or 1 byte integer goes up to 127 so I could probably use that for highway 08 if we look at the int 16 We can see that goes up to 32,000 and so for most of these I can update them So I've just got an operation here I'm going to say I want to use the as type method to convert highway to int 8 city to in 16 combined in 16 And let's just do a describe on these new columns here and that looks good You can look at the maximum value here and the minimum value if you want to as well and see that I Can certainly use those types without any loss of information Okay, and if I want to look at all the integer types I can do select the types with integer as a string and I can look at all the integer types in there And those look good after I've applied this update here Let's look at the memory usage now after doing this and now I'm down to 18 megs So it's a little bit here by just changing a few of those types same thing for floats here I can look at the float types and so I've got cylinders at the number of cylinders and engine displacement here It looks like cylinders is integer s. Let's just do a describe on that and you can see that The quantiles are the minimum maximum 25th 25th 50th and 75th percentile those are all integers. This is actually an integer column But because it had missing values in it You can see that there are 206 missing values using that value counts there that It pandas doesn't support missing values on lowercase in 64 types So I'm going to query that and just look at where those values are missing and if you look at where the values are missing we see like Ultra ev rap for ev rap for ev so a lot of these are electric vehicles and so Maybe we want to say, you know, if I've got If it's missing rather than saying it's missing say it's zero and then convert that to an integer eight similar with displacement I've got some missing values in there I'm just going to fill those in with zero as well And we can summarize that can just make sure that that looks okay after doing that And so Let's convert our float our displacement to float 16 instead of float You can again use that numpy f info of float info to look at the ranges or your floating point And if we do that it looks like our data looks okay there We can sum up our memory usage now and we're now down to 17 mags by converting those to eight and floats let's look at object types and for object types again, generally those are strings, but These are the comp I've got here drive engine description make model transformation and create it on so it looks like drives Probably categorical engine description looks like somewhat freeform Make is categorical models categorical somewhat with a large degree of Cardinality Transmission actually looks like two columns and created on looks like a date So let's try in and just look at those the first thing I like to do is I like to apply this value counts method to My categorical columns and there you go. You can see that drive indeed is categorical. Let's look at where drive is missing I said drop in a fault here and you can see that there's 11 100 values that are missing there I'm just use this query here to look where they're missing and it looks like a lot of these again our electric vehicles here Just scrolling through them. It looks like a lot of electric vehicles So maybe what I do is I say I want to fill in the missing drive values with other And then after I do that convert that to a category Also, I'm going to convert the make to a category as well and let's look at our memory usage Our memory usage is now down to 12 megs. So we've gone we started 19 or down to 12 We still got a bit to go. Let's look at transmission Transmission looks like a column that has two pieces of information whether something's manual or automatic and also the number of speeds So I'm going to split this apart into two columns here And so what I'll do is I'll make an automatic column Which is just whether it contains audit the string autumn and speeds I'll use a regular expression to extract the number there if they're missing values I'll fill them with 20 and convert that to an int 8 and let's look at our memory usage now We're down to 10 megs here. So that's almost half as much as we were using originally Let's convert our dates. We do have that created on so I'm going to try and do To date time now I get an error here or not really an error, but a warning here that it says that The time zone EST is is identified but not understood turns out that EST isn't a time zone. It's an abbreviation But it's not a time zone And so There's a thing says past easy info's argument to correctly make it time zone aware turns out that pandas really doesn't expose that At this level here. So I've got a little bit more to do on that So here here's just the issue here. It's got these ESTs and EDPs in here and pandas It will convert it, but it's going to give that warning there So I have to jump through it through a few hoops to get this to go I'm going to replace the EDT with the offset that I'm going to convert that to UTC and then given that this is all in Eastern time zone, so I'm gonna or the New York time zone. So this is actually America. New York is a proper IANA time zone and let's run that and you can see that when I run that I don't get any warnings here I'm also pulling off the engine description and We'll look at engine description as well Turns out that handling time zones and dates is kind of a pain probably in my next series of talks that I'm gonna do is probably on time zone handling in pandas since it's not super straightforward, but Let's look at handling this FFS or this engine description You can see that this is more just freeforms got like parentheses and non-print seas and weird spacing in here So this looks like freeform data, which is the vein of like a data scientist or someone who's analyzing data What I'm going to do is I'm just going to say let's just pull out whether there's FFS in there So as a Boolean and then I'm just going to drop that and I mean if you look here There's a lot that have FFS in them. And so maybe that's some sort of indicator here so at the end I'm going to have this function here called tweak and Let's just run it and make sure that it works and it looks like it does this is just going to clean up our columns and That looks pretty good Now that's the first part getting types right once you've got types right that enables you to save memory But also if you have string types You can do string manipulation if you have categorical types You can do string and categorical manipulation if you have dates you can do date manipulation So it's really nice to get your types in the right format now the part two is Chaining and so I find a lot of people aren't used to this style It's also called flow programming and rather than making intermediate variables. We just leverage the fact that impending is Most every operation will return a new object and we'll just work on that and we'll just keep doing that and So I've got a hint here if you can't find a way to chain you can use the pipe method to do to do chains So pipe method is a method and pandas that takes a function That takes the current value of the data frame or the series and then you can return whatever you want from that Generally, I'll return a data frame from that But this is my little function and I like to do this when I'm using pandas I'll take a raw data set and I'll go through a process similar to what I just showed and then I'll make a function called tweak Whatever my data is and I'll put that right at the top of my data frame So I start with my raw data and then I have another function that will clean it up when I come back to my notebook I just run the cell to load my raw data run cell to clean it up and I'm good to go so The other thing if you look at this it reads like a recipe, right? I'm gonna take my autos data frame I'm gonna plop these columns. I'm going to assign these columns here I'm gonna sign cylinders displacement etc. I'm gonna convert these types And I'm going to drop these two columns as well So this looks just like a recipe step-by-step what you're gonna do with it It makes it your code easy to read now compare contrast that code up here These easy steps of what I'm going to do step-by-step with something like this This is what I normally see and like medium posts or whatever and this is only going This isn't doing everything here, but this is sort of what people do They'll make these intermediate variables and then do some operations It's you know, does it work it might work and if I run it here you'll see that there are a bunch of warnings and this is sort of the bane of People who use pandas everywhere like this setting with copy warning like no one knows what this really is though Throw in a bunch of copies into their data to get around it Well, the whole reason people are trying to use this style is to not make copies, but they end up making copies anyway, so I Don't recommend this style here Note that it's longer and it's not as clear as is it's not as readable your colleagues actually won't like it as much either You won't like it as much when you come back to it So I highly encourage you not to use that now some people claim that it's not easy to Debug or understand this code, but with this you can debug it really easy. I don't buy that Again after using pandas for many years It's not the case If you want to you can like I said before you can just sort of comment these out And you can walk through each step and see what's going on with each step as you run it Alternatively, you can do things like I'm showing here leverage this pipe So if I want to see what the intermediate variable is after this first step in my recipe I can pipe in this function that I made right up here get var Which is just going to make a variable called d3 So let's run this and if I look down here this d3 is the intermediate variable So I can do things like that if I want in your intermediate variable I'll turnively I can also pipe in like this a function that just calls display And this is again the ipython display function that will show something in the notebook if you run this you actually see that This prints out the intermediate state right there and then the output is actually right here below that So you can Tacken these pipe operations here to either make a variable or show the content if you want to And that will make it really easy to debug and see what's going on Also again, I do recommend that commenting out and stepping through your code that way The next step here is our next part of the talk is don't mutate and again When I show this to people people are like well Matt you're wasting all this memory You're not using the in place you're not using the assign or you're not using index assignments So you're you're wasting memory and it turns out that Maybe I am but that's how pandas works and Jeffrey back core developer pandas says In place the in place operation rarely does something in place You're thinking you're saving memory, but you're not in place Usually basically makes a copy under the covers and then returns Sticks the copy in for the other object there So you're not really saving any more memory than you would with using chaining There's actually a bug that I've linked to to remove and deprecate in place So in summary if you do mutate and I again I see a lot of people saying you need to mutate or whatever You don't really get performance benefits from it You can't chain if you do that so your code's gonna be ugly and then you're gonna get all these silly setting with copy warnings that you're gonna go around and you're gonna make copies to get rid of those and You're you're actually like making it worse for yourself by doing that So it you'll note that if you use the style that I show with a sign and not using copy You'll never run into a copy issue at all Okay, and the next part here is don't apply if you can so here here's my autos data set I'm just gonna make this autos to here now. This is very US centric We talk about things like miles per gallon Well, most people around the world use things like leaders per hundred kilometers here So if I wanted to convert that City mileage to leaders per hundred kilometers I could do something like this with a Python function that does that conversion for me using the apply method However, I could also use a vectorized broadcast operation to same 235 divided by this series and if you look at these they give you the same result here Now The difference here being that the apply is going to pan through Python and the other one is not And so if you actually look at the timing here when I round these on my machine earlier this morning we got around 52 times Slower for using apply. That's because it has to deserialize stick things into pandas and go back and so Right now It looks like it's even slower. So But around 50 in this case around 50 Percent performance increase for a numeric operation so a ploy. Sorry avoid Apply if you're doing math now I'm gonna do some string manipulation here Maybe you want to say like is something an American car, right? So here's car makes that are American and let's use apply on that Versus this is in which is a method on a series here And when you run this you're going to see that the difference here isn't quite as drastic here and The reasoning here is that string operations and pandas aren't vectorized and so I Say avoid apply I think you can make the exception that if you're doing stream manipulation apply might be better Now note that make here is actually categorical and and so You do get some benefits from doing the categorical operation because it doesn't have to do it on every single string I'm just going to make an autos three here where I convert Make back to a string and you'll see that when I Do these operations on the strings? There is a performance penalty on them. So And in this case you can see that apply goes from being Almost as fast as that to a lot slower when we convert it back to a string here Okay, and and so these other examples are just showing some other ways to do this you can use a Wear operation or you can use NP select or NP dot where which are basically the pandas and numpy versions of doing like if statements and again I would recommend doing those if you've got numeric operations if you're doing string manipulations then you might Be fine with apply again because apply isn't vectorized Or sorry string operations aren't vectorized pandas Okay, the next part here is master aggregations. So if you're not familiar with aggregations, they're sort of like pivot tables in And Excel So I'm gonna take my autos data frame here and I'm going to group it by year You'll see this returns a lazy object and then I'm going to take the mean of that And so what that does is it puts the year into the index and gives me the mean of every numeric column there This is kind of cool. I mean Literally this could be written as one line of code I've written it as three because it's easier to understand but with one line of code here I have pretty big powers here now. This can be confusing Let's do another one here. I'm gonna group it by year there I'm gonna pull out the combined mileage and the speed mileage and then take the mean of that We can also do this we can say group it by year and instead of I can take the mean first and then pull off combined and speeds after that now if you change the order of that you're gonna get the same behavior but Your timing might be different because you can see if we pull off combined and speeds after we do the aggregation We have to add we're aggregating all the numeric columns up here We're just aggregating those two columns So that order there can make a difference depending on how many columns you want I'm gonna just set some styling here because I'm gonna add some plots in here So what I'd like to do is now I can like visualize this as well pretty quickly here So again, here's here's my group by it's lazy Let's take the mean and I'm gonna use pandas to plot this It turns out when you do a plot in pandas by default as a line plot. It's gonna plot the end Each column against the index so the index will be the x-axis each column will be its own line here So it's pretty easy to make a little line You can see that my styling has a font that is missing on this machine But I'll just sort of ignore that pink warning here But there there's a visualization makes it very easy to see that combined mileage is going up in recent years Speeds going up slightly as well, but not as much as combined mileage now once I've got this little chain here You know, I can look at instead of the mean if I want to look at the median. I just sort of swap these Methods out. There's the median. I can look at the 30th percent quantile, right or the standard deviation and or the variance so by using This chain style makes it very easy to just try things out and see what's going on, right? And if I want the original data, I just comment out that plot and I'm back to my data right there Now let's in this case, I'm going to a group by two objects I'm going to group by year and country and take the mean of that and Says country is not defined. So let me let me find my deaf country So I forgot that we forgot to run that deaf country there Okay, so deaf country is just a function that says, you know, is it in the US or is it in another country here? So let's go back down to where we were Okay, so what what I'm going to do is I'm going to make a new column called country Then I'm going to group by year and country and then take the mean value now You see that this is what's called a hierarchical or multi index the index has two values in it If you've used pivot tables, that should be somewhat familiar to you What I'm going to do now is I'm going to do my aggregation by those two columns But I'm going to aggregate with multiple aggregation functions take the minimum value the mean value and the second to last This is a user defined aggregation Just showing that you can make an aggregation there and there we go now. We've got not only multiple Multi index here, but also hierarchical columns as well And and so you can go crazy with that if you want to so one thing you might want to do is try and plot that Then you get this plot. It's a little bit silly because it's trying to plot this x You know each of these columns against this x-axis, which is now a tuple here with year and country So it's a little bit weird But what you can do is you can learn to leverage this thing called unstacked So let me just show you this code here Here's what I had before and now I'm going to unstack this what this is going to do is going to take the innermost Index here the country and stick it up there in the column so you can see now I just have year here and And if I wanted to now what I can do after I do that I can plot this CDO 8 columns Let me just sort of walk through this There's my unstack here I'm going to plot CDO 8 calm that has two sub columns below that US and other and I'm going to plot that and Now I can easily visualize what's going on with the mean city mileage for the US and other countries over the past 20 years or so and I can use this little map plot loop code to stick the legend out there So it's not on top of my plot there So again, you could write this as one line of code, but I'm splitting it out there to make it clear to see And if I want to smooth this out, I can just stick in a rolling here So a lot of people will do that with like time series data So take a rolling average So I just say rolling three take three segments here and take the mean of that to aggregate that not just smooths it out You can see this one's a little more jagged. This one's a little bit more smooth Okay So in summary Correct types will save space They will enable math if you have like stream types that need to be converted to numbers or They will also enable things like date functionality as well I'm a huge proponent of chaining if you don't chain you should start doing it It's going to make your code more readable. It's going to remove bugs and it's going to make it easier to debug I showed all of those things here Also, don't mutate in pandas There's usually no point Embrace chaining. I realize that a lot of people don't use chaining, but If you look at like the core developers, they actually recommend that this style as well Again apply is slow for math operations. So I see a lot of people saying use apply use apply Don't use apply unless it's a last resort Apply works fine for stream manipulation, but for math generally you want to use something else Otherwise, you're going to have slow operations and lastly aggregations are powerful I get it if you haven't used them a lot, especially like unstack or hierarchical multi indexes can be confusing So my advice there is just to look at examples and start playing with them until they start to make sense If you found this interesting And you like these sorts of things follow me on twitter dunder and harrison I tend to tweet a lot about python the day science again because that's what I do for my work I teach people how to leverage python until lies with data And like I said before I i'm going to have an upcoming course Where uh, we're going to work on these sorts of things on your own data So if you're interested in that idiomatic pandas workshop, um Again follow me on twitter there Okay Let's see there are some questions here aditha says I think I saw something on the lines of in place not being the best way I use in place to reduce the code and not to assign it to the same data frame Is there a better way to do that's without in place? Yeah, you should use chaining So if you use chaining like you you saw in my example above aditha You'll have less lines of code compared to the non-chained version You won't have any bugs and it's going to be easier for you to read Especially if you make one of those tweak Functions, you're going to be able to come back to your notebook and just use your notebook very easily Okay, other questions. We got like a minute left. I think Yep Okay Well, I'm not seeing anything here. If you want to reach out to me I'll be in the speaker lounge for a little bit or you can reach out to me on twitter Happy to continue the conversation there. Thanks everyone for letting me join you. It's been a pleasure. Have a great conference and Hope to maybe in the future actually be at picon india Bye everyone. Have a great day