 I'm Nathan and we're here today to talk about doing math with pandas if that sentence makes no sense to you Don't worry. You can go check out our previous videos to kind of get up to speed about what we're going to be doing here today but basically we're going to be cleaning up data and Getting doing some simple statistics with a pandas data frame. Don't worry It's going to be a lot easier than the math behind us handles will help us out All right, so first things first we're going to import pandas as PD like we have in previous videos And then we're going to load a data frame and call it DF From a file which we've saved also in a previous video So some of the basically when we instantiated our data frame originally We created a large empty data frame to make things faster and a little bit simpler But what the result of that is what is we ended up having a lot of empty rows or not with non values in it So let's get rid of those first with a DF dot drop now So this is drop not a number essentially and inside of here. We're going to do an inline All right in place Equals true. This is just so we don't have to be DF equals this otherwise it'll return a subset essentially So from here, we're just going to look at some basic math Basic statistics of our data frame right now. So so let's try a DF dot mean. I'm sorry We'll select a column from Our data frame you can see we have a city price lat long. So why don't we do DF dot price? And then we'll do a dot mean in parentheses and there you can see that mean of our data set is $878 for an iPhone a used iPhone. Yeah, so if you don't remember this is for an iPhone off of Craigslist so Not new Not recent so probably some incorrect data in here. I agree if we do a DF dot price dot Medium we get an actual middle value. Wow all of our data. So our data is around 8,000 data set data points So this is probably more accurate this point. So you can really see how a Outlier value can really skew your mean So let's let's go and find those outliers. So what we're gonna do next is Sort our value sort our prices by Price sort our values by price and we're just gonna look at just the city and the price values at this point Just to clean up what it looks like We don't really care about the latitude and long to at this point So that's what the first brackets is for city and price and then we're gonna do a dot sort values Inside of here. We're sorting it by price by default. This does it by a sending order And so we just want the top values at this point, so we'll do a dot tail Let's grab the last 30. What do you say? Yeah, so we'll go ahead and hit that and Look at that. You can see we have that 530 here and whoa suddenly we're up into the digits All the way up into the seven digits here. Geez Harrisonburg, Virginia. There's a five million dollar iPhone. That's Yeah, seven digits. So that sounds awfully a lot like That's a phone number. Okay. So of course, it's Craigslist. So other people also put in one two three four five and One one one one one one Most likely these values are not quite accurate And so we're gonna just drop off all the values that are Extreme in this case. So as you can see it jumps from six fifty to a thousand dollars You had about a thousand. Yeah, cut off everything above a thousand. Okay, so let's do We're just gonna reassign this data frame to itself. I guess by sub selecting Df dot price Only values less than a thousand You can put any boolean statement in here with your Data frame to get any sort of Subset and we're just setting it back to Df. So our noose data frame Let's see what the mean that yeah, let's see that. Oh I should probably just sub select the price column But you can see if you use the mean method on the whole data frame you get of Mean value for each column in that data frame But there you go Df that price that mean the mean of that is 179 and just before we go too much farther. Let's also look at The head of our sorted value just to make sure we don't have any too extremely low value. So okay about Yeah, I'll just copy what I had before We're just gonna do a head. This will give us the low-end values So we'll just do a head instead There we go Some of them are $50 dollars is their starting point Lot of $50 phones probably reasonable since there's so many at this point and Their old phones given that some of them might be cracked and damaged and everything. That's a reason. That is yeah I think we're good on that front. Yep So we looked at our me meeting So we looked at our mean and that was a hundred and ninety nine so let's also look at our median see if that changed much Also 180 so 180 so that didn't change because we only dropped about 20 values out of our 8,000 or so so and just a just to prove to ourselves that We We did actually actually get rid of those Do with another tail hit that of our sorted prices and sure enough We lost all our extreme values. Wow great So just to do some other Statistics, yeah, one thing that lots of people love is standard deviations The way to do this in pandas is you can some people might not consider love But you might so that's why we're here DF dot price will do an STD. No, we're not that type of STD STD is in standard Hit that and you can see our standard deviation for the price column is 62 if you know much about standard deviation with gauzing distributions Three standard deviations should get you Close to most of your values. I believe it's about 98 99 percent. Okay, so we're gonna display those two bounds So to do that we just got to add our mean and our mean. Okay, but our prices mean, right? I don't need those latitude long tube means that's true and then we'll minus three times DF that price that STD oops Extremely Go We'll copy this together upper bound. That's our lower bound since we're subtracting three standard deviations Sure enough we get minus seven and three hundred and sixty six for our bounds for our three standard deviations So that's fairly reasonable. That sounds reasonable. Yeah, so I don't know to me I mean doing it the math is nice, but it's also nice to visualize. So we'll show you a little bit of how to do that Here pandas does have some simple plots and everything Yeah, so to be able to we're gonna introduce another bit of I Python or Jupiter magic and this one is matplotlib inline So you start it off with that percent sign and do a matplotlib inline and this will display any sort of matplotlib Plots that you make in your data frame right in line in the notebook and this gives you a underlying knowledge of pandas The plots and pandas are actually matplotlib Created good point. Yeah, good point So let's just do a simple DF that price and we're just gonna do a histogram So the method for that is dot hist you don't have to specify any Arguments or keyword arguments. We'll just do hit that and it spits out a nice looking matplotlib histogram So each of these bars is Quantity of how many values fall in that range and as you can see our bounds as we got before we're minus seven dollars to 366 I believe That looks like most of our values actually fall in there. That's that's about right You know and these these blocks are a little bit coarse for my liking one thing you can do is specify the number of bins So we're gonna do bins equals 50. We'll try that Hit that up and you can see that that gives you a little bit more gives a little bit more character To the data like you can see for instance that a lot of people put their iPhone on the market for two hundred dollars That's kind of a little bit of just ways to look at the data So so with that we're going to Save our data just so we save our cleaned up data As we showed you last time you can just do a quick df dot To underscore CSV and then put your file name in there. We'll just call this df underscore clean that CSV and Hit enter now. We have our data for next time Next time up. We're going to actually look at plotting this on the US So great join us. Thanks for joining us be sure to hit subscribe if you enjoyed our video and we'll see you next time