 In this network we're going to do some summary statistics, some of the most basic things that you can do in data science. We're going to import some data and we're just going to describe that data and that really is one of the first things you can do when you start analyzing data, once you've got the data in Python and you've cleaned it up, just try and understand the story, the information that's hidden inside of that data. We are notebook 05 descriptive statistics and as I just mentioned one of the first things that you'll do when you analyze data it's very difficult for human beings to stay at for instance a big spreadsheet file or massive amount of images or audio waveforms, no matter what our data is in data science, very difficult just to stay at it and get some you know get some get the information that's hidden from it. We have to tease that knowledge that information that story that the data wants to tell us we have to get it out and one of the first things that we do is just we summarize that data and that's what the descriptive statistics or summary statistics that's what it's all about. So let's look at the packages that we can use in this notebook we're going to import pandas so import pandas as pd the namespace abbreviation so we've imported that namespace with all its functionality into python here then we can use a new one scipy scientific python it just expands on what numpy can do even further it's got a module inside of it called stats and in stats you get a bunch of statistical functions i'm going to say from scipy import stats so i'm not using a namespace abbreviation so if we want to use the functions in in that stats module we'll have to say stats dot from google dot co-lab we're going to import the drive function because our spreadsheet is in google drive so specifically to google co-lab and our data being in our google drive we have to use that drive function and then as always i use this percent load underscore ext that magic command just to then print my my spreadsheet files my data frames out to the screen nice and neatly next up we're going to mount our drives remember that's drive dot mount and then we pass this argument which is a a piece of string so it goes inside of quotation marks forward slash g drive and i showed you in the notebook before that that folder structure where this file is saved and this right at the end this data sub folder there i've got to put all of that as a string after the percentage cd change directory i want to change to that right at the end that sub folder because that's where my spreadsheet files are so i'm going to run this cell remember that's going to open a new login for me i have to re-log into my google drive give it the permissions that it requires and then copy that security key that it's going to give me to enter in the little block that is going to appear here and there you go it's happened and i can all you can also see the percent ls magic command there that's just going to show me all the files that are inside of that and the one that we're going to use in this section is the client underscore data csv file so once again i'm using the read underscore csv file from the pandas package so pd my name space abbreviation pd dot read underscore csv and because i'm already in that folder i change directory to that folder i can just reference the csv file directly and we're going to assign that to a computer variable called df to hold our data frame object let's just have a look at the data frame object just to make sure it imported properly and there we see we have a first underscore name last underscore name very important to have those underscores when you design these so that you don't have these illegal characters so we can just use the dot notation when we refer to that column and we see age number of children email address all of this is fictitious data job title home loan more than one vehicle financial literature review savings investments etc so some economic data pertaining to specific customers or clients so let's examine this data frame object as we always do we're going to call some of the attributes we've seen before first one is the shape attribute and that's going to tell us how many rows of data we have so how many data on how many subjects or how many observations all those are synonyms for the word rows so there's a thousand rows along eight or should say eleven there eleven variables so if i want to see the names of those eleven variables remember that's the columns attribute so df dot columns and that shows me all the variables that we're dealing with a society data so every rows and observation or subject every column is a very specific variable with a very specific data type so let's look at those data types at least what pandas things they are so first name is an object which is a string which is a categorical variable so it is for last name ages a 64-bit integer a children 64-bit integer email as an object in other words categorical variable or string so it is with job title and then home loan it sees those as bool values boolean it's two or false values more than one vehicle it sees that as true or false values so let's just look up there if we look at those indeed home loan and more than one vehicle it was captured as true or false so pandas correctly identified those as boolean values and then you can see the rest there so the first thing that we're going to do is we're just going to do counting and we're going to refer to these as the frequencies how many times does one of the sample space elements or each of the sample space elements in a variable how many times does it occur and we've seen the unique method before so i'm going to say df dot invest remember that gives me back just that column in a pandas series format so df dot invest now that i have that column as a series object i can call the dot unique so the dot unique method on that and that's going to show me the sample space elements inside of that invest column so i see it's three one two five four and remember it's in the order in which it accounted that going from the top to the bottom now a much more useful one method i should say when it comes to frequencies is the value underscore counts value underscore counts so once again i'm going to say ds df dot invest so it gives me a pandas series object of that column the invest column and then i call this method value underscore counts so not only is that going to give me back the unique elements you see there two five four one three but it gives me the frequency of each of those so you can see how many times the value two appeared in that column five four one three and again what it's going to do here it is just going to go down the frequency in a descending order so the the sample space element in my invest column in my invest variable that occurred most commonly was two and that occurred 210 times so that's a frequency we can also do relative frequency so what we're going to do is just divide by the sum total remember there was a thousand observations i just devised by a thousand so that i can get a proportion or a fraction of each of these instead of the absolute value which is the frequency the actual count that's now going to give me a fraction of the whole which is which is what we call a relative frequency and all i have to do there is use one of the keyword arguments for the value underscore counts method and that's the normalize with a z so this is all american spelling so df.invest again that gives me the invest column in pandas series format and on that i call the value counts method and i pass the argument normalize and i set that to true remember true and false as far as the actual python words are concerned the keywords is uppercase t and uppercase f so if i do that i now see the fractions and the fractions must always the proportions must always sum to one 1.0 that's the the the the sum total of them and if i wanted to do that as a percentage i'm just uh if you really wanted to do that i don't really see why but we can just do some broadcasting where we multiply each value by a thousand and once we do that by a hundred and then we get percentage 21% 20.8% 20.1% etc so a little exercise for you there if you're interested just to stop the video and and do that for yourself more interesting though we're going to do grouped frequencies so how do we do grouping now grouping occurs by the sample space sample space elements of a categorical variable so you take a categorical variable and it'll have a sample space some unique values and we're just going to group by those and then we might count the frequency of some other categorical variable so what we have here in the first instance we're going to create this this what we call a contingency table and we're going to do home loan which is a categorical variable and more than one vehicle which is also categorical variable and we're going to pass those two categorical variables in series format because i'm using df dot and then the name of the column and df dot the name of the column that gives me back remember pandas series object and i'm going to pass those two elements to the cross tab function so pd for pandas pd pd dot cross tab and then pass those and look what it's going to do it gives me this nice little table so it says home loan here false in two and then the second one is going to be across the top there so that's more than one vehicle false in two so if we look at no home loan and not more than one vehicle there were 268 and a home loan with a not more than one vehicle 239 and then the two values uh two four four more than one vehicle and two and false for home loan and then right at the bottom the 246 more than one vehicle and with a home loan so that's a two by two contingency table two rows two columns and that's what you can see there with these four values then two rows and two columns now measures of central tendency we've talked about how to count things frequency relative frequency and as the course goes on we're going to see many more examples of those so remember this is just an introduction so let's do some point estimates or central tendency and what we do there is we take a variable and all its values and we calculate a single value that is representative of the whole and that's what a mean or an average is taking a bunch of numerical variable values and I express a single value that is representative of all of them and the first one is the arithmetic mean or the average and remember how you do that you just sum up all the values and divide by how many there are so let's see what the average or mean age was of all the customers so I can just say df. age so that df. age is going to just give me just that column as a panda series object and then I call the mean method on that not passing any arguments and that's going to give me the mean age in our data set and that's 44.739 years now what is the mean value of customer savings now remember I could just say df. savings but I'm just reminding you here if you do have illegal characters like spaces in your column names or your variable names you can always put them inside of quotation marks inside of square brackets so df and then savings and then dot mean and that's going to print out for 30,136.49 as far as the average is concerned now let's do something much more interesting I want to know the age the average age of only the customers that have a home loan and remember we do that with conditionals and I'm using a bit of a shortcut here and then I'm not using dot loc or dot i lock I'm just going for it directly df and then all the rows that I'm interested in let's go down the home loan column and look for all the values set to true so equals equals true so that's the only ones that are going to be selected from that give me the age column and of that age column give me the mean so that's how we would construct that that's going to give me the mean or the average age only of those customers that have a home loan so you can also do the mean average or the mean age of the customers without a home loan so that equals equals false as far as that conditional is concerned and then instead of putting age like that just showing you a little bit of a variation I can just say dot age because age contains no illegal characters something a bit more interesting what if I wanted to do those two steps all in one and here is a really a method that you're going to use a lot the group by method I'm saying df dot group by and then inside of parentheses because this is a method I'm going to pass one of my columns inside of quotation marks this time so home underscore loan group by whatever you find in the home loan and then the age and then the mean so let me show you what it looks like and it'll make a little bit more sense so in the home loan it found two sample space elements false and true and then it gives me the mean age of each of those so the customers without a home loan the average age was 45 and those with a home loan the average age was 44 so that group by becomes it's going to become something very useful to us and we're going to use it a lot remember there's also the geometric mean and that's calculated a little bit differently and that comes from the stats library so I can't call that as a method on a panda series object so I'm going to say stats dot g mean for geometric mean and then I'm going to pass my panda series df dot age and that's going to give me the geometric mean so that's just multiply all of those together and then take the nth root where n is the number of observations that we do have median now what the median does it puts all our numerical variables in ascending or descending order and then if it is an odd number of values it's going to take the one right in the middle for which half of the values are less than and half are more than and if it's an even number it's going to take the middle two and just take their average and that works very well when I data when our data is v skewed when data is v skewed we have outliers we're always going to pull that average towards the outliers and that value is now no longer a proper representative remember that's what we're trying to do with point estimates one value that's a proper representative of all the values and in that instance the mean might not be so so we have the median so let's have a look at this what is the median saving of customers older than 50 again we're just going to use some conditional there df and then df dot age greater than 50 so it's going to go row by row in that age column and then from that we want the savings column so I could also just have said dot savings dot median and anyway dot median will calculate the median for me so let's look at the median savings of the participants or the subjects or the customers that were older than 50 years of age now remember how we put those conditionals together as and and or so what we're looking at here is where the home loan is true and the more than one vehicle is true as well give me the median of the savings so referring you back to the previous notebook where we looked at these we just call catenate them together and and or so the next one we're going to say either the home loan is true or the more than one vehicle is true and then give me the savings column and the median of that savings column excellent and there's a little exercise and solution set for you now we come to the mode now the mode is where we don't have continuous numerical variables remember continuous numerical variable can have many decimal places we just truncate those maybe the apparatus that we use in the lab can only give us until a certain number of decimal places but in actual fact the decimal places go far beyond that and we don't do a mode for that mode means the value that occurs most commonly so yeah we're really going to go for either discrete data or categorical data anything that has a very fixed sample space elements and if there's one sample space element that occurs most commonly that's going to be the mode sometimes two of them are equally have the equal highest frequency and then we'll call that a bimodal variable and then you also get trimodal and even more multimodal anyway let's look at the number of children so i'm going to say df dot children dot value counts and remember it's going to do that in descending order so it sees three children that was the most common so three you can think well it is really numerical but you can't have a fraction of a child so it doesn't really help to to express the average number of children now you see that all along but but really that is a very difficult concept because what is 2.18 of a child very difficult there so i tend to use any kind of discrete data rather use the mode next up we're going to look at measures of dispersion now that we know that we can take a data set a variable and we can calculate the single value that's represented of the whole we also need to find out how spread out the data and those are measures of dispersion so the first method of dispersion that we're going to talk about is the standard deviation and the variance and they actually refer to the same thing and you actually calculate the variance first take the square root of the variance to get the standard deviation so what is the standard deviation well that's the average difference between each value and the mean for that variable so I just take all the differences between each value value number one and that's difference with the mean and I sum all of those up and divide by how many there are so that's the average the problem is some values are less than the mean and some are more than the mean so we're going to get negatives and positives and we sum all of those up you're going to get to zero so the standard deviation is zero and that doesn't really help us that's the way that the that the mean is calculated so what we do is we square each difference and remember when you square negative value it becomes a positive value so all our values will now be squared and if you add all those squares all those positive differences and divide by how many there are you're going to get to the variance and what you'll have to do is to take the square root of that to get to the standard deviation but that's exactly what the standard deviation is the average difference between each value and the mean for that specific variable it's very easy to do and I'm just showing you one example here we're going to just group by the home loan so all the sample space elements of this categorical variable we're going to group by that then take the h column and take the standard deviation and that's the std method and if you can see up here there's also a dot var method there and so it's either dot std or dot var for standard deviation and variance and you'll see the two standard deviations there the range is this the difference between the minimum and maximum value and remember how we get min and max there we go df.age so that gives me our our column there of only that that column the values for the age column or in series format and then I call the dot min method on that so the minimum age is 25 as we can see then the maximum age is 65 and the range is going to be the difference between the minimum and the maximum so I'm going to take the maximum minus the minimum and that gives me the range 40 so when it comes to age age of animals age of organisms age of human beings I like to express the range many times in the reports you'll see the standard deviation but that doesn't really tell me uh that much I want to know what the minimum age and maximum age was because I want to infer the results of your lab to my lab and and it's kind of neat to know the youngest and the oldest age that brings us to the quantiles now remember the median that just chopped up my my all my values new continuous numerical variables in half I put them in ascending or descending order chose a value for which half were less than and half are more than that value but I needn't only chop it up into into halves I can also chop it up into quarters and for that we'll get the first quartile the second quartile and the third quartile so the first quartile is going to be a value for which a quarter of my values are less than and three quarters are more than the second quartile is going to be the median in the middle the third quartile three quarters of the values are going to be less than and one quarter more than and just for completion's sake we actually also have the zeroth quartile in the fourth quartile that'll be the minimum and the maximum values and the way that we do that is with the quantile you can see there the quantile method so for age again that's df.age the quantile and then inside of a python list I'm going to pass all these values 0.25 0.5 0.75 so we actually put that in fractions and that's going to give me those values so as far as the age is concerned we see that at 0.25 so we know that a quarter of customers were younger than 35 half were younger than 45 and three quarters were less than 55 or you can obviously see it in the opposite direction a quarter was older than 55 but I can express any of those fractions so here we just go at the 95th now we call it percentiles so we're going to say df this time I'm going to group by the true or falses in the home loan take the age column and give me the 95th percentile that means 95 percent of the participants and we see 63 for both so both those within without a home loan 95 percent were younger than 63 or we could say 5 percent we're older than 63 and into quartile range is the last one I want to talk to you about and that becomes very important when we look for outliers in our data or when we want to create box and whisker plots and that's coming up in the next notebook very exciting notebook visualizing data and all I'm going to do is I'm going to subtract the value of the first quartile from the value of the third quartile so it's that difference between the third quartile and the first quartile that gives us the inter quartile range so I'm just going to say df.age.quantile 0.75 minus df.age.quantile 0.25 and that gives me the inter quartile range of 21 so there you go just starting to tease out that that story that the data is trying to tell us by summarizing it in some way either calculating a single value that's represented over the whole or then giving us some idea of how spread out the values are