 Well, I'm back. And so are you. Welcome to chapter 3.3 percentiles and box and whisker plots. It's Monica Wahee Library College Lecturer. And this is what we're going to talk about. And this is what you're going to learn. At the end of this lecture, the student should be able to explain what a percentile means, describe what the interquartile range is and how to calculate it, explain the steps to making a box and whisker plot, and also state how a box and whisker plot helps a person evaluate the distribution of the data. So let's get started. You know, whenever we talk about a box and whisker plot, I think of some cute little animal with all those riskers, I'll explain what the whiskers really are. I mean, not on the animal, but on the box and whisker plot later. So what are we going to go over? We're going to go over percentiles. And we're going to explain what those are. Then we're going to talk about quartiles, sounds a little similar, it's got the tiles in it. Well, you'll understand why they're similar. Then we're going to compute quartiles. And then finally, we're going to do the box and whisker plot. Alright. So let's go. So percentiles, we're going to have a flashback, okay, you're not going to like this little part, because it's going to remind you of standardized tests. So maybe not all of you have been subjected to this, but most of us have, if you gone to high school, in the US, you probably got to deal with the standardized tests. So just remember, we're only talking about quantitative data, right? So if you take a standardized test or a non standardized test, you usually get points. And points are numerical. So that's quantitative data. So I remember I used to take the standardized tests, and I'd be, you know, showing my friends what I got, right, because they'd send you that thing in the mail. Now, I learned pretty early on that it mattered, who all was in the pool of people may taking a test with you, right? So if you're taking the test with a lot of stupid people, it's easier to get a higher percentile. Because what percentile means is, for example, if you test at the 77th percentile, it means you did better than 77% of people taking the test. And a lot of those standardized tests, they didn't care how many points you got. What they cared about is what percentile you were at. So different batches of people would have different scores. And if you got a lot of lucky, got a lot of stupid people, then your score would be higher than theirs. So it didn't really matter what your absolute score was, it just mattered what your percentile was. So just to sort of remind you, if somebody had come up to me in high school and said, Oh, I got 77% percentile, what I'd say is, Okay, if only 100 people had taken the test, you'd have done better than 77 of them. Well, of course, we were all braggie braggie. You know, I was always in like the 95th or the 97th or the 98th. And it happened so often, I wondered if it was really true. But what I realized is, is that there were so many people in the pool, because you know, I was in public high school in Minnesota, well, they were pulling together all the public high schools, Minnesota, ninth grade, you know, I was pulled with them in 10th grade or whatever. And when you're taking like nursing examinations, sometimes they'll do that, they'll put you on a percentile. So I try to tell people, you know, strategize, try to take him when only stupid people are taking it, which of course makes no sense. How can you tell when stupid people are taking it, right? You don't even know who's taking it. But really, that's, that's what a percentile is. It's the percentage of people that you did better than if you're at the 77th percentile, then you did better than 77%. Okay, so here's just some rules about percentiles. First of all, you know how I gave the example of the 77th percentile. Well, the rule is you have to have one between one and 99, like you can have the negative second percentile or the 105th percentile. So that's the first rule. Then whatever number you pick, like I was saying, that percent of the values fall below that number, and 100 minus that number of the values fall above that number. So like in my, well, here, we'll give an example. 20 people take a test, just 20, right? Let's say there's a maximum score of five on the test. The 25th percentile means that 25% of the scores will fall below whatever score that is. And 75% will fall above that score. So let's say it's an easy test. And let's say out of my 20 people, 12 get a four, which is almost the total, right? And the remaining eight get a five. So everybody gets either a four or a five. Well, then, you know, the 25th percentile or the score that cuts off the bottom five tests, right, will be a four. Just because this was an easy test and every, you know, the first 12 people got a four and then the rest eight got a five. So even the 50th percentile then would technically be at a four, right? Now, this would all come out differently if it were a hard test. And most people got a score below three, right? And so the percentiles would be shifted down. I just tell you that so you can keep in mind the difference between the actual score and the percentile. So the percentile just happens to mean that this percent of people got the score lower than whatever your score is. It doesn't actually say what your score was, right? So that's what you just want to remember as we're going through percentiles. Okay, now we're going to talk about quartiles and also the inter quartile range, remember the tile thing. So this relates to percentiles. So I put a little quarter up there. So quartiles is a specific set of percentiles. And you'll see why I put the little quarter up there. It's because there's technically four quartiles. It's just that the top quartile doesn't count because it's like the 100% one and remember, it can only go up to 99. Like I was just showing you. So we calculate the first second and third quartile. So we have the 25th percentile is the first quartile. The 50th percentile, which is also known as a median, which you're already good at, right? That's known as a second quartile. And then the third quartile is a 75th percentile. So those are your quartiles, 25th, 50th, and 75th, and technically 100th, but we never say that, right, because it only goes up to 99. So you have the first quartile at the 25th percentile, the second quartile at the 50th percentile, and the third quartile at the 75th percentile. And these are actually not that hard to calculate by hand. So here's like how you do it sort of an overview. So first, you order the data from smallest to largest, because remember, we have quantitative data, so you can sort them. So you sort them smallest to largest. And this is feeling very median, right? Well, guess what, that step two is you find the median, because the median is also the second quartile, which is also the 50th percentile. So already, you have no how to do this, right? Because you could already do step one and two. Now, this is the harder part. This is the new part. Step three is where you find the median of the lower half of the data, right? And so wherever you put your median, you pretend that's the end, and you look at the smaller values, and you find the median of those. And that would be the first quartile or the 75th percentile. Then finally, step four, which you probably guessed, is you find where your median was, and you then you look at the upper half of the data between the median and the maximum, and you make a median out of that part of the data. And then that's your 75th percentile. Okay, and I'll show you an example of us doing that. But this is an overview of the steps. Now remember range before what the range was? Yeah, you remember, that's where we had the maximum minus the minimum, right? And I told you, you have to actually do out the equation and tell me what number you get. And that's the range. Well, we have something new and improved in this lecture here, we have the interquartile range. Okay, so you already know about quartiles, we were just talking about them. But interquartile sort of means like within, right? So once you have the third quartile, and you have the first quartile, you can calculate the interquartile range or RQ or R for short. So if you see IQ are on here, just remember that's interquartile range. So that's the third quartile, minus the first quartile. And again, I'll show you an example. This is just an overview. Okay, here's the example I promised. On the right side of the slide, you will see a sample of data I collected. I went to a hd.com. That's American hospital directory.com. And it provides publicly available information about American hospitals. So I went in and I took a random sample of 11 Massachusetts hospitals. There's a lot more. So I took a random sample. And what I did was I wrote down how many beds each of those hospitals had. Because if a hospital has several 100 beds, they're considered kind of a big hospital. And if they have less than 100 beds, they're considered a smaller hospital. So I wrote all those numbers down. And then I already did step one of making our quartiles, which is to order the data from smallest to largest. So you'll see on the right side of the slide, my smallest hospital had only 41 beds. And my largest hospital had 364 beds and see I put ever all of them in order there on the right. And so we already did step one. So let's go on to step two. So the step two is to find the median. And that's quartile to or the 50th percentile. Now you're already good at that, right? And so we have 11 hospitals. So we know that the sixth one in the row is going to be the median, you know, because it's an odd number of hospitals that I drew. And so the sixth one will circle it. That's the 50th percentile or the median. So we already got quartile to it's it's funny that you have to start with quartile to but that's what you have to do. Now I just recolor coded these so you could kind of remember what's going on as we do the other steps. 126 is the median. That's kind of not on anybody's side, it's not on the lowest side and it's not on the highest side. The orange ones then are considered below the median. And the blue ones are considered above the median. And so I just color coded them so you can keep track of what's going on in the next slides. Okay, now we're going to do the 25th percentile for step three. So the goal is to find the median of the lower half of the data. So now you see why I color coded it is because now we're pretending just the orange ones exist. And we are just finding the median of that. And we're not counting that 126 because that's already been used. And so now we find that 90 is the 25th percentile. How you remember that it's not the 75th. It's not the third one is because it's a low one like 25 is a low number. And 75 is a higher number. So you go to the lower part of the data, you find the median of that and that's going to be your 25th percentile. And so in our case, that's 90. Then you probably guessed it, you go to the blue ones right the upper half and you go get the median out of that. And so, of course, ours is 254. So that's our 75th percentile. So what we just did is we calculated our quartiles, we have our 50th percentile, our 25th percentile and our 75th percentile. So that's what I meant by that overview slide. This is an example of how you would do that. And of course, I have to give a shout out to the IQ are, which is the interquartile range. Remember, you just learned that. So that's the 75th percentile minus the 25th percentile. So in our case, that's going to be 254 minus 90, which equals 164. So that is your IQ are so if I gave you a test and I asked you what is the IQ are for these data, you can't just put 254 minus 90, you actually have to work it out and put 164. So there you go. So that's our quartile example. So I just wanted to step back and give you some philosophical points on what happens with q one and q three, depending on how many data points you have. Okay, so remember, the first step of this is always to put them in order from smallest to largest. So let's pretend I had only only drawn the first six values of my hospitals. See how I put on the slide, I put the position of the number, which is 123456. And I put above the example numbers. So let's say I was going to do the median on that, you know, what I'd have to do is I'd have to take 90 plus 97 divided by two. But then the next question is what do we do for q one and q three? Well, given that in the example of having six values, the 90 and 97 are mushed together for the median, they don't get they can get reused or they do get reused when looking at the bottom and the top half of the data. So when we went to go to do q one in this, we would actually count that 90 in there. In fact, q one would be 74, because that's the median of the three numbers below the median, right below that line. And then the q three would actually be 121 because we actually count the 97 in there. So in other words, when you have like six values, and the median is made out of mushing together two values, like taking the average of those two values, those two values, they get to double dip, they get to be in the bottom. And the bottom one gets to be in the bottom. And the top one gets to be in the top when calculating q one and q three. Now, well, what if we had seven values instead of six? Okay. So I just expanded and pretended we had seven hospitals. And you'll see that I have seven positions there. Well, this was a little like the one we did together with the 11 values, where the median was clearly this 97 here. In this case, it's 97. So that 97 does not get reused in the bottom and the top. So you'll notice that q one is the middle number of the three bottom ones. And q three is the middle number of the top three ones. And so that's what happens when you have seven values. And it also happens when you have 11 values, like I demonstrated with those hospitals. But it's not super predictable because what if you had eight values, we suddenly see it gets a little complicated. So how will we do this? Well, see the first four are between 41 and 97, top four between 121, 155. Well, to make our median, we'd have to take the mean of 97 and 121. But remember, they don't get used up the 97 then gets to double dip and be a part of the calculation for q one. And 121 gets to double dip and part be part of the calculation for q three. But even even with this double dipping, if you go down, you'll see that there are four then numbers to contend with for q one. So of course, to get q one, you actually have to mush together or take an average of 74 and 90. And if you go up the upper part of the data, in order to get q three, you're going to have to make an average of 126 and 142 or the ones in position six and position seven. So if you're unlucky enough to get like eight values, then you realize you're going to have to make your median by making an average of two numbers, your q one of making an average of two numbers, and your q three like that. So it's not super predictable what's going to happen. You just have to pay a lot of attention. Just remember, if your median is made out of two numbers averaged, those numbers get to double dip in the downstairs and the upstairs of calculating q one and q three. If instead, your median is just one number, like because you have an odd number of values, then that guy has to just stay there and does not double dip in q one and q three calculations. So we can just see another example of this. So this is nine values, right? Now we remember when I had 11 values, it was like having seven values, I had this median, and it was really clear like we have here. But even the medians of the top of the data and the bottom of the data, they were just, you know, it was an odd number. And so it was easy to figure that out. Well, you see here, in this case, our median is the fifth value. And that's 121. So 121 does not double dip anywhere, right? So we go to calculate q one, we only have four values, because we're not counting the 121. And then we're stuck with taking an average of the second and third value to get q one. And then same thing upstairs here, between, you know, 142 and 155, you know, those are the two middle numbers of our four numbers at the top. And then we have to take an average of those to get q three. So I guess this is just my long way of saying you got to be really careful what you're doing. First, make sure you've gotten the median, then figure out if that median is this kind of a median where it's just your circling in, or it's a median that came out of an average, because if it's a median that came out of an average, just know that those numbers are going to double dip in q one and q three. And if it's a median that was because you had an odd number of data, it was just like in the middle, that one doesn't get to double dip. Okay, enough double dipping, I'm getting hungry. When I go to that roller coaster, I'm going to get a double dip ice cream cone. Okay, we're going to move on to box and whisker plot, which is kind of like your percentiles getting graphed, right? So let's go back to our ingredients, we already created our box plot ingredients. In fact, that's why I trickily went through those quartiles first, because now we've created our ingredients to make a box plot. So I just sort of summarize what we have on the left side of the slide, say that 50 times, hospital beds was what we were counting. The smallest literal hospital had only 41 beds. q one was 90. This is a little easier, I put it in order, cure one was 90 median q two was 126. You know, I what I mean, I mean quartile, right, like by these cues, then q three is 254. And then the maximum was 364. Okay, so let's make a box plot. And then you remember what the data looks like on the right side of the slide. Okay, well, now I'm going to walk you through how you would make this box plot. So first you draw this thing. Well, how do you know what to draw? Well, I usually just draw a line and vertical line, and then put a zero at the bottom. And then I cheat, I go look at the maximum, I go, Oh, I wonder where that is. And see our maximum was like 364. So I just made 400 at the top. Our maximum had been something like, you know, I think Massachusetts General Hospital has something like 600 or 800 beds. If we had gotten that one in there and that was our maximum, I would maybe go up to 900, you know, whatever is a little bit above the maximum, that's what I put at the top. So this was 364. So I put 400. Then what I did was I divided it in half, like I see where the 200 is, I just kind of threw that in there. And then I divided between the 200 and the 400 and a half and put the 300. And so you can just kind of eyeball this and draw it out that way if you want. Okay, so I got this thing set up. And then here we go, we're going to do the first thing. Okay, here's the first thing, we're going to draw in q one or quartile one. So on the left side of the slide, you'll see I circled that that's 90. On the right side of the slide, I made this horizontal line. Now how why do you make that line? Well, look at how it's proportion to that that up and down graph thing I made, you know, with the numbers, you probably don't want it too wide, but you don't want it too skinny. This is just about right, like Goldilocks just right. Okay, so you just make this horizontal line at q one. So that's the first step. Now you make a copy of that same line parallel, and you make it at q three. So if you look at that, if you're I hope you're not lost, if you look at that, you know, 100 200 300 400, you know, q one is 90. So it's about 10, under 100. So that's how I knew where to position that lower one. And then 254, that's about, you know, a little bit higher than halfway between 200 and 300. So that's where I roughly knew how to position this one. It's not perfect. If you do it in statistical software, they put it out and it's perfect. But for demonstration purposes, that's what I'm doing. Okay, so now what we've done is we put in q one and q three, we put these horizontal lines that are parallel. All right, here's the next step. We connect them. Hence the box. So the box gets made, right, that you just connect them. All right, now I put a little circle on the right side of the slide because I wanted you to make sure you saw what's going on there. Okay, that's when we put in q two or the median, right? So the median is 126. See where 100 is it's up a little bit. And we make that parallel. But you see how I made q one q three connected the box and then did the median. I think this is the easiest order to do it and when you're drawing it by hand and you're not the statistical software. Because in that way, you know, this box is all nice and then your median fits and everything looks nice. But we're not done yet. We got the whiskers. So you're probably wondering this whole time, what is this whisker thing? Well, you just figured out what the box is. The whiskers are the markers for the minimum and the maximum. So you'll see the minimums at 41. And then we have a whisker at 41. So why is it called a whisker? Well, it's smaller. I don't know why it's called a whisker, but it's different from the other ones, because it's smaller. I guess that's a reason maybe. But notice how it's like half the almost half the size. Sometimes they're really, really small. But it's tiny. And you want to position it like vertically in the middle, like you don't want it off to the side or anything. But and you also want these parallel, you'll notice the maximums up there way high at 364. So I just did both of these on the same slide. So you draw on the whiskers, and then you probably can guess the last step, you connect the whiskers to the box. So good job, there you went and did it. You made a box plot. And then now let's look at the interquartile range. Remember how you calculated this you took q3 minus q1? Well, that means this boxy thing is 164 beds long, right? So that's where your IQ are. This is a visual pictorial of your IQ are so very good. We did our box plot. We did our interquartile range. And you are probably wondering, why did we just do this? I'll explain. So why do we do this? Well, one of the main things that we do is we look at the distribution in the data. I know, I know, you guys learned how to do a histogram already, and you're good at a stem and leaf. Those are other ways of looking at the distribution. And if you make a histogram of these data, you'll find that well, I mean, these are only 11. But you know, if you get a pile of data, and you make a histogram and a stem and leaf, you'll find that those images agree with the box plot. And you're probably thinking, well, how do you how do they agree? Well, if you look on the right side of the slide, I'm just giving you an example. So skewed right, if you had skewed right data, and you knew it because you made a histogram, and you saw a skewed right distribution, if you took the same data, and you made a box plot, it would be kind of like that skewed right one that we just did, where the top whisker would be really high in that thing connecting the whisker to the box. That would be like really long, whereas the one on the bottom short. As you can see the skewed left is the opposite, right? The bottom one is long, and the top one short. If you have a normal distribution, remember that that's symmetrical, that's that mound shape distribution. And you have a larger spread. In other words, you have a bigger standard deviation, you have a bigger variance, right? Then you're going to see a box that's really big like that. But if you have a smaller spread, and it's a normal distribution, you're going to see a box that looks like this. And you're probably wondering, where are you getting these shapes? Well, I'll show you a kind of on the last slide here as we wrap up the conclusion. It's because if you fly over a roller coaster, like see this roller coaster. This roller coaster is skewed right. That would make sense, right, because you want to go up steeply, and then go down really fast. And see how the box plot for the roller coaster looks. You've got sort of the part where you start going up really fast. That's kind of near the median and kind of near the 25th percentile. And the part where you start where you're just getting on and it's slowly going there, that's like the bottom whisker. And then you go up and you come down. And it's a long tail, which is good, I guess, if you design roller coasters. And then that long tail then is that right skew. So that's why I mean, if in your mind, you're going, how is she getting this this histogram and this box plot? This is kind of how I'm doing it is I'm saying, Well, if you flew over the histogram, or the roller coaster, you might see like a shape of a box plot. So in conclusion, we talked about percentiles in general, like the 77th percentile, what that all means. And then we focus in on quartiles, which are a specific set of percentiles. And then we're going to go, or we already did calculate the quartiles. And the reason why we did that is because we first needed to do that in order to make the interquartile range. And then finally, we need those quartiles in order to make and interpret a box and whisker plot. Okay, this isn't the roller coaster I'm going to, but I'm going to one and I guarantee you it is skewed right.