 So we finally get to do some analysis. We're gonna do descriptive statistics or summary statistics. And what you're gonna see is that the data hides the information that we require. It hides a story and we have to tease that story, that information out. You're also going to be too, I'm gonna repeat myself. You're gonna hear me say that human beings are very bad at looking at rows and columns of data and making any kind of sense of it. We have to summarize it in some way. So in this video, we're gonna open our notebook. We're gonna connect to our data. We're gonna import that. We're going to inspect that data and then we're going to start summarizing the data to try and get some information from that data to start to get a feel for what's going on. And that is our first step before we move on to really analyzing the data. We first need that understanding of the data before we start to properly analyze it. Here we are in our Google CoLab descriptive statistics or summary statistics. What is this all about? Well, human beings are very bad at staring at rows and columns of data and making any sense of it. We can't just stare at a spreadsheet file and maybe that spreadsheet file even goes off the screen. So you have to scroll both left and right and up and down just to see all those numbers. There is a story hidden inside of those data values. There is information there. We asked a research question and the answer is locked away in the data and it is our duty to bring that hidden information, that story out of the data. And we start that journey by summarizing the values that we do see in our dataset. It's called summary statistics or descriptive statistics. It's the first thing that we do when we start to analyze data to start to get a sense of what is going on. And we summarize all those values. We replace them with one or two values that represent all of those values. And that single or two representative values, they give us a sense of what's really going on. So let's start off by just looking at the libraries that we are going to import. You see Pan is there. SciPy, from SciPy import stats, that's a new one. And then the Google Drive and just to show the tables in a nice format. So let's connect. You see right up there, remember, we're gonna hit the connect button or when we run the first code cell it's going to connect anyway. So let's give our internet connection and Google on the other side a chance to connect and we see everything is connected there so we can start going. So pandas, we know almost everything we need to know about right now. We've done pandas, so there's pandas as PD. And then there's SciPy, that's scientific Python. That's a fantastic library. It's got a module called stats and it contains lots and lots and lots of statistical functions. Almost everything we are going to need. And so from SciPy import stats, let's do that. We have to connect to our Google Drive. So from Google.co-lab we're going to import drive and then as I said, our data table so that we just have nice tables that we can print to the screen. Let's import this data and see what it looks like. We're going to connect to our Google Drive. Remember, we've imported Drive now. It has this mount function, so drive.mount and it's always forward slash G drive and that's as a string. So you see the quotation marks there. And then we're going to CD into the G drive and then we're going to CD again into my and I'm leaving that space there. If there's a space, I've got to escape the space with that little backslash. Remember, you don't need to do any of those if you put this inside of quotation marks. Also, I've shown you how to import the data from GitHub, all these files and if you save them in a different folder structure on your G drive, this will look different for you how you get to this data folder. And then the LS just lists whatever is in this data folder right here at the end. We've seen how to do this, so let's just run that. Now remember, we've got to log on. So I'm going to log on, you know how to do that by now. And we see all the files that are in this folder called data and we are going to use this data.csv file here. So I'm going to call it df, that's my computer variable name and you know by now we're using this readcsv function inside of pandas and we're just going to import this folder, this data set I should say. There we go, let's print it to the screen. We can just say df because we imported this specific to Google Colab formatting. We can see the data and we've seen it before. It's this mock, little study of ours, where patients were either in an active or placebo group and we've checked their cholesterol before and after and we see what change they've been in their cholesterol. So all just random data, there's nothing specific to it. It's not very realistic, but it gives us something to work with and we don't have actual patient data on here. Let's examine this data frame. Always important just to see that what you've imported is what you expect. The first thing that we're going to do is just call this shape attribute or shape property on our data frame and that will tell us how many rows and how many columns we are. So 200 rows and if our data is tidy, that means every row is one participant in our study. All their data is on one row and there are 13 variables. So we've got 13 statistical variables. They'll all be of a different statistical type, data type which we've discussed and we've got 13 of those. Now we can also look at these 13 by calling this dot columns attribute or dot columns property. We see we have name, date of birth, age, their vocation or job they do. They smoke, something about their smoking, their heart rate, systolic blood pressure, cholesterol before their lipids, their survey question that they were asked, cholesterol after, what the difference was between the two and what group they were in. And lastly, let's have a quick look in our data frame. Are the data types what we expect them to be? Name is indeed an object that's nominal categorical. Date of birth is a bit wrong because we have an object there and that should be a daytime object. So that's a bit of a problem. Age is a 64 but integer. That's a numerical variable that's fine. Vocation is an object that looks good. Smoke is an integer. And we might think that there's a problem we saw it in the previous video in as much as the values, the sample space elements were zero, one and two. Those are numerical variables but they represent a nominal categorical variable. So there's a problem there. Heart rate and systolic blood pressure looks fine. Cholesterol before is a decimal value. So float 64, that's fine. The lipids, that's fine. The survey question, they could choose an answer between one and five and that is now represented as numerical variables but we know it's ordinal categorical variables. Those are not real numbers. So we have to watch out for that one. Cholesterol after looks fine. Delta looks fine. And the group once again is nominal categorical variable. So that looks fine. One of the first summary statistics that we're going to do is just to count certain things. And the best data types to count of course is categorical variables. So let's look at counting. Let's start off with frequencies. We have a group column. We've seen that before. Patients were either in the placebo group or they were in the treatment group. So let's just see how the data was collected by looking at the sample space elements. All the different elements that could be chosen from at random for each and every patient or participant in our study. So if I say df.group, group is one of the columns, that's going to return for me. Remember a pandas series and a series has certain functions and methods I should say and certain attributes or properties. And we're going to call the dot unique method. And if we do that, we see there were only two sample space elements for that categorical variable. Active and control. So that's a nominal categorical variable. We can't put this in a natural order. So it is nominal. Now we want to know how many of our participants were in the active group and how many were in the control group. So we want to count. That's the frequency that we're talking about. So to do that, we're going to first call up the pandas series df.group. And then we're going to use this method value underscore counts. It's a method. In other words, a type of function. So it's going to have the open and close parentheses. We're not passing any arguments there. So value counts. It's going to look down that pandas series, which is our group column. And it's going to see these two unique values and it's going to count how many times each of those occur. And our data set is neatly balanced. We had 100 participants in the control group and 100 participants in the active group. Now we might want to express it as a frequency, 100 in each group, but we might also want to do a relative frequency. A relative frequency is something we express as part of a whole. So you can well imagine that if the whole here is one, the half of the participants would be in the control and other half would be in the active group. And to get those relative frequencies, we're just going to do normalize, use the normalize argument here and we set it to true. So it's still value underscore counts and we just say normalize equals true. So there we have 0.5 and 0.5, just as we suspected. So this is relative to the whole now. We can also express this of course as a percentage by just making use of this broadcasting. Broadcasting just means if I multiply this by some integer value or some number or if I added it, it would be added or multiplied to each individual value. So if I multiply by 100, I'm going to get the percentages. And in this case, it's very simple to see that 50% of people on the control group and 50% of people are in the active group. So that is a frequency and a relative frequency. Let's just do a little exercise. So I want you to go down the smoke column. It indicates where the participants never smoked. That's a zero. So it was captured as a zero and you can see that is why PAN is at the moment thinks it's 64 bit integers, but it's not as nominal categorical variable because we're just using these numbers to represent something much more real. They never smoked or they are current smokers, that's one or they have smoked before as a number two. So I want you to calculate the frequency with which each element appears. I hope you tried that. Let's have a look at the solution. So I'm just going to call df.smoke. That's going to give me the panda series for that column and I'm going to use the value underscore counts method and that's going to show us 88 people are non smokers, 85 people are smokers and 27 are ex smokers in our group. And if we want to express this, just an extra little bonus, if we want to express this as a fraction and then times a hundred, which gives us a percentage and I say ascending equals two, then we're just going to get this order and order of the percentage. So the lowest was 13.5%, those were the ex smokers, 42.5% were smokers and 44% are non smokers. So we could also say ascending equals false there and then we'd get the descending order. Group frequencies are the second thing that I want to show you. We might not only want to count by a certain panda series, but we want to count a categorical variable by splitting the participants up into other groups by the sample space element of some other categorical variable. And look at this. I want to know here, we see df survey and df group. So df survey, remember, there was a survey question. If we watched the video on pandas before, you'll know they could choose one, two, three, four and five. I want to count that, but I want to split it up as well by the group. And that's called cross tabulation. So pandas has a function called cross tab and in cross tab, I'm going to pass two panda series and look at what the results are. So the survey was the first one, first argument in this cross tab function. So that will go down this first column. Patients or participants could either choose one, two, three, four, five. And then we see the second one goes across the columns here, active and control. So in the active group, 21 participants chose one in the survey and 17 in the control group chose one. In the active group, 18 chose two and 32 in the control group chose two. So you can see we're still doing counting here, but we've broken it up by this second, the second argument that we give there. So we've broken it up into the sample space elements of this nominal categorical variable and we see active and control. If you swap those two around, you were just going to see one, two, three, four, five at the top and you're going to see active and control. Try that for yourself. You'll just see the table just transposes is what we call it. So whatever way makes sense to you, put them in those order, but you can see here in what order you put it, what the effect of that is going to be. So we can do these grouped frequencies. That is where we just group the numbers by the sample space element of another categorical variable. So we have to create a categorical variable if we had a numerical variable, we'd have to create bins out of those and then we can do this grouped frequencies. We're counting by certain groups. Now that we know a little bit about frequencies and relative frequencies, let's start discussing measures of central tendency, otherwise known as point estimates. And I'm pretty sure everyone is quite aware of what these are. Measures of central tendency or point estimates. Now these are single values that are representative of the whole for numerical variables. For numerical variables, the first two, at least, there's a point estimate that is of course a bit different than that. The one that we know, most of us know about is the mean or the average. It's proper term is the arithmetic or arithmetic mean. And that is where we just add all the values and we divide by how many there are so that we get this average. So let's look at the average or the mean age of all the participants. So if I just choose that column, so df.age, that's going to give me that panda series of that column and it has a dot mean method. Remember before, you would have noticed, you might remember right in the beginning we changed this to a numpy array first and then we used the dot mean method for numpy array but I told you pandas series like this also has a dot mean, a dot mean method. So if we do that for this panda series, we're gonna see that the mean age or the average age of all the 200 participants was 53.07 years. Let's look at the average heart rate and I've put this in because I just wanted to remind you of the different notation to get a panda series and perhaps the more proper way to do that. So df and then square brackets and then we use the column name there HR instead of the dot age. So yeah, I could have also said df.hr that would also work. So just to remind you of the different notation and we see that the average heart rate of our 200 participants was 74, about 75, let's round up beats per minute. Now, what if we wanted to get a bit more fancy? We wanted more information from our data and we only interested in the average age of the smokers and remember how we do that. If you can't, this will jog your memory. You can watch that video again or you can just start using it. The more you write this code yourself, the more you'll remember how to do it and it'll just become second nature. So I'm gonna call df the data frame and then inside of square brackets, I'm going to pass this conditional df.smoke so that's smoke column. It's pandas series equals equals one. So go down case by case and only if the smoke value is one. In other words, the smokers, are we going to have include that in what we're trying to achieve? From that, we want the age column. So pass that in square brackets and quotation and single quotes and then I have all the values of the ages of only the people who are smokers and then I call the dot mean method on that. So if we look at our smokers, they are about 56 years old on average. Let's contrast that because now we're starting to learn something about the data. So let's look at all the people who've never smoked and on smokers. So that would be indicated by zero. So our conditional here our conditional here would be df.smoke equals equals zero and from there we want the age. And if we run that we see they only 50 years old. So definitely a difference between the 50 and the 56 but is that of statistical significance? How do we do that? What kind of test do we do? So how do we discover this and what about the third group of the ex smokers? Do we have to include them? Now we're certainly going to reveal all of these things. How do we know what tests to do? What, how, how do we decide is this significant? How do we decide that there is a statistically significant difference between 50 and 56? And we'll certainly get to all of these. Now I want to just introduce you to the group by method because instead of going through each and every one of those sample space elements and if they're more like the survey column in which they were five, it's going to take you a long time to do. So this is a much easier way to do it. The group by method. So I'm going to call DF, my data frame and a data frame has a method called group by and you tell it by what statistical variable which of the columns you want to group by. So I'm going to say by the smoke. So it's going to make three separate groups. One's with zero, one's with one and one's with two. From that, once you've done that grouping I'm interested in the ages and then I want the mean. So dot the dot mean method. And if you look at this it makes so much sense the way that this is constructed. And now we can see the 50 years old on average for the non smokers, 56 for the current smokers and 53 for the ex smokers. So do you think there's a statistically significant difference here? I'll show you just how to do that in a later video. So very, very easy to do. I want to remind you that there are more arguments to this dot mean method. For instance, if you have cells, I'm going to call them cells, but if there are some participants for which the age was not collected, you could say skip in AS true, but that is by default, I think it's set to true. So if there are empty values, it's just going to be ignored. And then sometimes I mentioned that right in the beginning it's very weird sometimes how data is collected. So instead of 43, someone is going to write 43. They're going to write it out in words. So you could also use this argument numeric underscore only equals true. If you do that, all these other values that are not numbers will also be just ignored. Now that's it for the mean. Play around with the data and see what other mean values you can calculate. The next mean I just want to make you aware of, we don't use it that often. That's the geometric mean. And that is where we multiply all the numeric variables and then we take the nth root, where n is the number of cases you have. So we would have had to take the 200th root. Now we all know the square root, that's just the second root, but we'll have to take the 200th root after multiplying all the ages. And that is available as a function called G mean, geometric mean, but that's in the stats module of sci-pi. So we, from sci-pi, we did import stats. So now we can just say stats dot and the function that we're interested in is G mean and what do we pass? We pass the pandas series, which is the age, df.age and now we get the geometric mean. As I say, it's not something we use often, but I wanted to show you that it exists. Now we move on to the median. It won't make much sense to us now why you would use the median, but many people would know that we use the mean if the data is nicely distributed as a bell-shaped curve. So most of the values are bunched in the middle and as it gets away from the mean, fewer and fewer and fewer and fewer cases are known. So that's the famous, the familiar bell-shaped curve. And then we use the median, but we don't always find this distribution of values. Sometimes all the values are bunched at the lower end or at the upper end and sometimes we have values bunched in the middle but our outlier is far away. Under these circumstances, we're going to do the median and we're going to talk all about that in a later video where I show you how do you decide when these values are so weird that you can't use the mean. And that also means there's a whole set of statistical tests that you can't use and we have to go on to other types of tests. And then it becomes very important for now, let's just see how to do the median and the one that I've got here is I want to know the median heart rate of all the patients who are 50 years or older. So we know how to do that with a conditional. So we call DF and then our DF.age, which is our pandas series for that column and older than 50 from that I want the heart rate and then the dot median method as simple as that. And by the way, what is the median? Well, you just put all the values in ascending or descending order and you take the middle one. So that's very nice if you have an odd number of values. So you have five values, you just place them in ascending or descending order and you take the third one. The third one will have two values that are lower than it and two values that'll be higher than it, it's easy. If it's an even number, so you had four numbers, you would take the middle two and you just take the mean of those middle two. And if you take the mean, think about it, two values will then be lower than it and two values will be higher than it. So the median is quite easy. So here's a little exercise for you. You can stare at it before you go get yourself a drink or take a break, but do come back. So calculate the median age of the participants who smoke and have a heart rate of more than 70. Now that's a bit of a contrived question, but you can well imagine if you do research, there's very specific participants from your data that you want to draw out. So this is exactly what you're going to do. So first of all, just a reminder of all the columns that we do have. So we do have smoke there and we have heart rate. So think about how to construct this. Good, are you back? Let's have a look at the solution. I'm going to use the dot loke, remember? The dot loke, the location. And because I want both of these things to be true, I'm going to use the ampersand and I'm putting each of my conditionals inside of parentheses. So it's df.smoke equals equals one, so only include them. And df.heartrate more than 70. That's exactly what my question said. And because we're using loke, we are using this row comma column idea. So it's going to go down all these rows, comma the column that we're interested in is the age column. And then we call dot median on that, the median method because that's what we want. And we can see the median age of patients who had a heart rate of more than 70 and who were smokers was 58, very simple. Now we get to this measure of central tendency or this point estimate that we can use for discrete data or for categorical data. We can also use it for numerical, but continuous numerical, remember, we only caption because our devices can only capture a certain accuracy. But in reality, those values go off with infinitely many decimal points. So mode really for discrete data or categorical data. And that's just where we just want to know which value occurs most commonly. That will be the mode for that variable. Now we don't have a dot mode method. So instead we just use value counts. So if we want to see from our smoke column, what about three categories of smokers occurred most commonly, we could just look at this by default, we're going to have this ascending equals false, but you could also set it to true. But we have this and we see 88, 85 and 27. And that is because the sort was also set to true. It is going to sort them. So we see 88 were non smokers, 85 were current smokers and 27. So the mode of smoking was zero. Non smokers is the mode of the smoke variable. Let's move on to measures of dispersion or measures of spread. We've had point estimates now. We represent something by a single value. Just a single value. But we might also be interested in how spread out that data is and that's measures of dispersion. Let's have a look at that. The first one we're going to talk about is actually two things. And that's standard deviation and variance because they go hand in hand. Now we can think of the standard deviation as just the average difference between each value and the mean for that variable. So we have the mean sitting somewhere in the middle and if we just look at the difference between a specific value and the mean, we'll just subtract one from the other. And we're going to get either positive value or negative value depending on whether that value is larger than the mean or higher than the mean. And of course, we're going to have both of those scenarios. And if we add all of those up, the negatives and the positives by design, we're going to get back to zero. So we have to do something to calculate this average because as I said, if standard deviation is this average difference, we've got to add all these differences and divide it by how many there are. We actually divide by one minus how many there are because we're dealing with a sample. But that's not important for us right now. So how do we get every value to be positive? Well, one way mathematics to do that, of course, is just to square because any value squared will be a positive value. So we square all the differences and then we divide by how many there are but that thing gives us the variance. That's what we call the variance. But if you think about squaring it, if the unit was, you are so many years old now, you're talking about years squared. What is a years squared? What kind of unit is that? And to get back to the units that we are dealing with, we just take the square root again of this value and that turns the variance into the standard deviation. So let's have a look at the standard deviation. We're again going to calculate the standard deviation of the ages here, but I'm gonna use this group by method of a data frame. So dataframe.groupby, group by the three smokers, please, the three groups of smokers and I want the standard deviation of the ages. Now, we remembered what the averages were for those and now we have the standard deviation. We see it's actually quite close to each other. But now we can express the median plus minus, the mean I should say, the mean or average plus minus 12. The mean or average plus minus 12, the mean or average plus minus, for two there, it's about 13. That's the plus minus. So that's one standard deviation away from the mean would be 12 years. Let's just have a quick look at the variance then and if you squared those values, 12 squared is about 144 and that's how you see how we get to these values. It really is just the square or the standard deviation is the square root of the variance. Let's have a look at the range. The range is very simply the difference between the minimum and the maximum value. So we're gonna call df.age. Once again that gives us our pandas series and we're just gonna call .min for the minimum and very simply we're gonna call .max for the maximum. There's really nothing about it and the range then is just the difference between the maximum and the minimum. So I'm gonna take the maximum, I'm gonna subtract from that the minimum and that gives me the range of that function. So the youngest participant was 30, the oldest was 75, that gives us a range of 45. The quantiles is the last one we're going to look at and remember where we said that the median, we were going to just put them all in order and split it right down the middle while we need them to split down the middle. We can split at any, if we think about all the values representing 100% of the whole, I can split it at 25% so the 25% of the values or quarter of the values are less than this value and three quarters are more than this value. So as long as I put them in ascending or descending order, I can take any percentage split I want from 0% to 100%. 100% of course will be the max because 100% of values will be less than the maximum and the 0th quantile will be the minimum because there's no values less than the minimum value but we can be as peculiar as we want, we want the 90th percentile and we'll call that quantile a percentile if we express it as a percentage but we also express it just as a fraction and then we talk about quantiles and they are very famous or not famous but well used set of quantiles and that's the quartiles. And so we have the first quartile, the second quartile and the third quartile. The first quartile as the name would represent is the 25th percentile. So that 25% or 0.25 of the values are less than and three quarters are more than. The second quartile is nothing than the median because I'm just dividing the data set into two. So half of the values will be less than and half of the values will be more than and the third quartile, I think you get the picture. So how do we do this? Well, we use the dot quantile method. So again here, I've got age and I'm saying dot quantile and I pass as a list all the quantiles that I want and these are actually the quartiles because 0.25 is the lower quarter, 0.5 is just the median, the second quartile and 0.75 or 75%, that's the third quartile. And there we go, I'm going to get these values. The age of 43 would mean a quarter of people are younger than 43 and three quarters of the people are older than 43. 54 was our median, we saw that and 64 is our third quartile. So we can get very specific if we want, we want the 95th percentile there. So quantile, we're only gonna pass a single value there, 0.95. I want a group by the smokers and I want to see what the 95th percentile is, what that age is for the three different smoking groups and we see 72, 73 and 71.7, as simple as that. You'll also see many times if we do express something as a median, we don't want the standard deviation plus minus, but we actually want the interquartile range and in those instances, we'll just subtract the third quartile value, from that we'll subtract the first quartile. So it's df.age.quantile 0.75 minus df.age.quantile 0.25. So I'm just subtracting those two from each other. And I hope I've really wet your appetite and I showed you just how easy it is to do summary statistics and many papers that you'll read only summarize the data. It'll just be a simple study that just a case series or case control series and we just want to describe the data in some order of the type of thing that we are seeing, just describing the data. And I've shown you almost all of the descriptive statistics that you are going to use in your own analysis. So go through this notebook, go through this video again and start typing and seeing what happens. And then if it doesn't, if it gets stuck read the pandas documentation, just Google the problem, you are going to find the answer. That's the beauty of Python. There's such a huge community and there's so much help out there. Or just simply leave a comment down below if you've got a question and I'll answer that for you. Most importantly, get your hands on a data set, download this one and start typing some code. I hope you enjoy it.