 We now understand enough about the Deeper levels of statistics so that we can start getting some stuff done In this lecture, I want to talk about measures of central tendency and measures of dispersion. Let's get going We're going to as always import our style sheet. There we go changes to a nice gray background Everything looks nice. What do we want to import? I'm going to import pandas as PD numerical Python numpy as NP Old friends metplotlib.pyplot as PLT Seaborn as SNS warnings as filter warnings I'm going to use my magic command to plot my graphs right inside of this notebook this web page Just playing around a bit. I'm just changing the style of Seaborn a bit you can as I say look up Seaborn on Google or in your favorite Search engine and just see how you can play around with line widths figure sizes fun scales the type of background Etc. You can draw some beautiful Graphs filter warnings. I just want to ignore as per usual Let's start off with measures of central tendency We see the mean median and mode This is import some of our data And I'm going to put it into a read CSV read CSV remember it's going to put it inside of a data frame. Let's go mean median and mode Now you can have a string of values and the two of us me and you we want to have a discussion every time We talk about our data we can't Repeatedly tell each other ten thousand data values. I say well remember our patient group so-and-so they had white cell counts of 10 point I'm It's impossible to do it's just this is wrong. We want a compact way of telling each other Some or give each other some representative value value that represents the whole data set and mean is One of those median and mode to two others So that is a representation a single value that represents all of our data Now mean is the easiest one I can ask anyone what the mean is the average value is and taking the numbers four five and six and Ask you whether that is well. There's four. There's five. There's six You can add all of them and divide by how many there are you can be left with five so that's a very simple way of Mean is a very simple way of of of getting a single value to represent all your others median is a bit Difficult to understand but sometimes much more useful. It is simply the value That would be a value of which half of your values would be More than that value and the other half of your data set would be less than that And once again, let's use four five and six five would also be the median because half of the values is below five Which is four and half of the values is above five which was six So you only want to count the values irrespective of what they are if I had four five ten million Five would still be the medium because one value is less than five and one value is more than five The problem comes in if you have an even number of values look at nine eleven twelve thirteen fourteen sixteen Now there are six values quickly, you'll discover that it's twelve and thirteen is point of division there So what you do then is you take those middle two that you discovered And you take them in twelve plus thirteen is twenty five divided by two is twelve point five So if I were to put in twelve point five there three values will be less than Twelve point five and three values will be more than twelve point five So that value needn't appear in your data set It is just a value for which half of the values in your data set will be less and half is more And it's a very good thing to use It's a very useful thing to have because you might have a data set like this ten eleven nine forty nine ten thirty forty three Now these are all very close to each other And there's one two three four five six of them, but these two others are quite away from those Now would it really be good to represent this data set as a average a mean It's going to be skewed towards the city in forty three and you can ask yourself is that value that we're gonna have really representative of this data set it might be much better just to give a medium in this instance There's some other subtle uses of it as well And we're gonna talk about the modified Alvarado score for patients with suspected appendicitis You're gonna get a score for your patient and these are integers They are they gonna score three four five six, whatever And if you have a bunch of these patients the scores are really fair to suggest that they that those values have a certain mean a Mean is much more geared towards arithmetical medical values to do some arithmetic with those values if you have Integer values like this as an a score system It's much better to use one of them and say look half of the values are more and half are less It's more representative of the patients half-wits. We're obviously sicker Clinically at least than than the others. It might be it's too useful Examples of using a median mode is the last one And that just simply tells us Which value occurs most often if you have a look at my data set there 9 11 12 13 14 16 There's no mode there because each value occurs only once There's in the value that occurs twice or at least more than the other if there were two values Then you'd have by modal try modal multi-modal depending on how many Values there were that occurred most often irrespective of what they were Let's look at measures of dispersion Now we can represent a whole data set with a single value But think about this. I have these values 9 10 and 11 their median is 10 the Average or mean is 10 and I have two 10 and 18 they also have a median of 10 and An average or mean mean a million of 10 So if we talk about these two data sets and we talk to each other We say remember that data set mean of 10 I remember that data set mean of 10 average of 10 million of 10 But really those are two completely different data sets We can't just use a single value to represent this data There's another type of value that we that we want to add to this just to tell each other Give us get give each other some idea of how spread that data was now there is various methods of measures of dispersion and In other words telling ourselves how far Spread the data was the first one is to simply the range this works very well for instance when you just want to Tell people what the age range was in your data set And it is the minimum value and the maximum value and you subtract the minimum from the maximum and that was your range very simple The one that we really want to deal with is variance and standard deviation What is variance and standard deviation? It's much easier to talk about standard deviation because variance is just the That's the one that you calculate But it is the square of the standard deviation or the standard deviation is the square root of the variance And why do we have this square issue? Let me show you on this little graph. Don't worry about this code All I want you to know is there's our white cell count and we have this list of 10.1 12.4 13.1 40.6 9.9 10.3 11.1 12.9 10.9 12.7 I Can clearly work out an average for those but I can also plot those on a line And that's the code I've written here. So don't worry about it. Let's just execute that and there we go In green There's all there was our 9.9 probably down here the one just over 10 and this one 14.6 or whatever it was is there. I can put them all on a line all the green dots there And I can ask also to do the average one there. We are the mean is that? So what is standard deviation? What standard deviation was very cleverly? It does the following What is the distance from this spot to that spot? There's the distance So all I'm going to take is 12 and I'm going to subtract from that 11.1 Whatever those two values was and there's one distance I'm going to take the next distance and I'm going to take the next distance and the next distance the distance from here To all the points on both sides and I'm going to add up all those distances and divide it by how many there are And that'll give me the average distance My whole data was away from this spot and that is the standard deviation Now of course in this direction, it'll be negative distance in this direction positive distance We won't want that but remember if you have a value of negative 2 and you square it negative 2 times negative 2 is positive 4 That's where the various thing variance thing comes in because we just take all of these and we square the values We add them all up. We divide them by how many there are and we take the square root of that so that we are back To actual distance away and that's standard deviation one standard deviation. It tells us the average distance All our values are way So remember we had an example of eight nine and ten eight nine and ten would be very close to each other So our standard deviation would be very small But if we had the other example, I can't remember what the values were But the spread in that data was quite far apart. So the standard deviation would be much larger So in the end we're gonna have an average and then we're gonna have a standard deviation of so much But then we can also work out what would be to standard deviation Which what would be three standard deviations and for that standard deviation? We can actually get a value We can say well three standard deviations would be there So there's the two blue blocks that would be one standard deviation So two standard deviations would be about out here. So it will just be just under 15 Etc. Etc. And the further out you go the higher the likelihood is that you contain all your data. So Standard deviation is just an average distance away from one average value away from the me Lastly quartiles what almost lastly quartiles and percentiles This works a little bit like millions in other words. You're also gonna have all your values All your values and you can work out the quartiles quartiles stands for four quarter four So you're gonna divide all your values into four equally sized groups irrespective of what the values are You just want an equal number of values in each little group and you get the zeroth quartile Which is the minimum value and the fourth quartile, which is maximum value Which makes the second quartile obviously the median and then you get first and third quartiles in between And for each of those just like for median we had a value All those quartiles will have percentiles are even more percentiles divides your data set into hundred Quartiles is four percentiles is divided into a hundred little boxes And now you can be much more specific, but the 25th percentile would be the same as the first quartile The 50th percentile which is the median would be the same as the second quartile The hundredth percentile would be the same as the fourth quartile, which would be the same as the maximum Okay, those are all quite easy. You can easily ask your computer program to do it. Remember we imported our data Let's just run that a little mook Mock there. We just displaying remember dot head. We displaying the first three values and if we say dot describe Just the ages there were 150 patients a mean age of 30 standard deviation of 10 and We can work out the range because it will be maximum 67 minus minimum 18 But you see there the 25th percentile, which is the second The first quartile the 50th percentile or the second quartile or the median there third quartile 75th percentile and then the maximum Containing then all our values and you can play around you can play around with these you can look at a length of stay also median standard deviation minimum, etc so Just to recap then you'll have one value to represent all of your values And you'll have another value that just gives you an idea of how spread your data is Excellent