 Good day, everyone. This is your lecturer, Monica Wahee. And we're going to start now with section 1.1. What is statistics? So here's our learning objectives for this lecture. At the end of this lecture, the student should be able to state at least one definition of statistics. Yes, there's more than one. Give one example of a population parameter and one example of a sample statistic. Also, the student should be able to classify a variable into quantitative or qualitative and as nominal, ordinal interval or ratio. So what we're going to cover in this lecture is first I'm going to go over some definitions of statistics. Like I said, there's more than one, but they all sort of relate to the basic concept of why you're doing statistics and especially not math. So what's the difference, right? Then we're going to go over population parameter and sample statistic and you'll know what those mean at the end of the lecture. And finally, we're going to go over classifying levels of measurement. So let's start with the definition of statistics. And so we're going to go over these concepts like what it is. And also I'm going to define for you the concept of individuals versus variables. You may know definitions for those words already, but I'm going to give you them in statistics ease. And then I'm going to give you examples of statistics individuals and variables in healthcare. So here are the definitions. What is statistics? Statistics is the study how to collect, organize, analyze and interpret numerical information and data. Well, that sounds pretty esoteric, right? But if you actually think about it, even if you did a simple survey like you just did, like you just look on Yelp, right? You look on Yelp and you see, you know, the restaurant you want to go to some people say five stars or four stars, but there's a few two stars, one star. Well, do you go? I mean, there's a whole bunch of different answers. So how, how do you do that? You kind of have to analyze it. You kind of have to interpret it. So it's not that easy. Um, so statistics is both the science of uncertainty and the technology of extracting information from data. So in other words, if you've got a bunch of data about like a restaurant, um, you don't know how it's going to be if you actually go there, right? You don't know for sure. But, uh, so it's a science of uncertainty. If you look on Yelp, and you're seeing almost everybody's giving it a four or five star, maybe it's going to be good for you, right? But you don't know, maybe there's new management. That's the uncertainty. So statistics is used to help us make decisions, not just whether to go to the restaurant or not, but important statistics such as in healthcare and public health. Well, I guess if it's an expensive restaurant, maybe it's important. But anyway, in healthcare and public health, you really need these statistics because they really guide you. Like for example, let's think of the Center for Disease Control and Prevention in the United States. So what do they do? They spend the whole year studying the different flu viruses that go around because there's more than one. They spend the whole year doing that. They organize, analyze and interpret, uh, numerical information and data about these different fact, uh, viruses, the different influenza viruses that are going around. They extract that information. And you know what decisions they make? They make the decisions about what viruses to include in the next year's vaccine. Are they always right? Sure enough, they're not. I mean, have you ever had a year where you're like, oh my gosh, everybody I know got vaccinated and they're still getting sick? Well, you know, give them a break. It's the science of uncertainty. They, it just didn't work out that time. However, this is probably better than just randomly guessing, right? So that's statistics for you. Now, I promise you, I tell you the statistics ease version of individuals and variables. Now, if you're outside statistics, you know that individuals are people, right? And you know that a variable is a factor, like a factor that can vary, you know, like the only variable is I don't know what time something's going to happen. But when you enter the land of statistics, there are specific meanings to these two words. Individuals are people or objects included in a study. So if you're going to do an animal study with some mice in it, those would be the individuals. If you do a randomized clinical trial, and you include people who have Alzheimer's in it, then patients are your individuals. But we do a lot of different things in healthcare. We sometimes study hospitals like the rate of nosocomial infections, in which case, if you're looking a whole bunch of different hospitals, those would be the individuals. Sometimes we look at states, rates of infant mortality, for example, in different states. In that case, states would be individuals. So as you can see at the bottom of the slide, a variable then is a characteristic of the individual to be measured or observed. I give some examples on the slide. But like I was saying, you know, if you wanted to study a hospital, for example, I gave you the example of a variable of a rate of nosocomial infections. You could also have other variables about that individual or hospital, like the rate of in hospital mortality. And so as you can see, one of the things we do in statistics is we sit down and we decide, well, who are going to be our individuals that we're going to measure? And what variables are we going to measure? So I just threw up here a few examples of different kinds of individuals we have that we use a lot in healthcare and public health. And an example of just one variable about those example individuals, but there would theoretically be many variables about them. And I just want you to notice a lot of times the individuals are geographic locations. Other times they might be institutions, like I said, like hospitals or clinics or programs. There's other things that they are, but these are just kind of the big ones. So as I was describing, and just to review what I went over, statistics is used in healthcare and other disciplines to aid in decision making. Like I gave the example of the CDC and their vaccine for influenza. And so therefore it's really important to understand statistics because you need to understand these processes in healthcare, like how do we figure out what to do? Like not only what do we do, but how do we figure out what to do? And that's really important because we use statistics a lot in healthcare. Now we're going to move on to talk about what a population parameter is and what a sample statistic is. So we're going to go over first the definition of a population and the definition of a sample. So you're sure about what those mean. And we're going to talk about the data about a population and the data about a sample and how those are different. And then we're going to get into what I was just describing parameters and statistics and I'll give you a few examples. So let's start with what is a population? Again, another case where you just have a normal old word, but it has a special meaning in statistics. Well, it's a group of people or objects with a common theme. And when every member of that group is considered its population, right? So here's just one example. So the theme would be like nurses who work at Massachusetts General Hospital. So the population then, if that was your theme, will be the list from human resources of every nurse currently employed at MGH. Now it really does depend on how you define that theme. Like I could have said, nurses who belong to the American Nursing Association, right? And then we'd be looking at a different list. I could say nurses who live in New Orleans in the city limits of New Orleans who live there, right? Then we'd be looking at a different population. So it really has to do with the details of how you describe the theme around that population. But the point is, once you describe that theme, the population is every single individual in there. So then what is a sample? Well, it's a small portion of that population. It can be a representative sample, but it can also be a biased sample and we're going to get into that. So let's just go back to MGH and think, let's say we were going to survey a sample of the population of nurses at MGH. Let's say we only surveyed nurses in the intensive care unit. That would be a sample, but not a representative sample. So it would be a small portion of that population, but not a representative one. Probably more representative would be if we asked at least one nurse from each department. And so I just want to get in your head that the whole concept of sample is, is that it's just a small portion of the population. And it's not a portion of some other population. It's just that one. But the problem is you can get a biased one or a representative one. So you have to think about. So when you think about it, if you've got a whole population, then you would get variables about each individual in that population. And those variables would be your data. But if you chose samples, then you know, just a portion will be a lot less work, right? You'd still have to get variables about those individuals, but there's way fewer individuals. So it would probably be easier. So in population data, data from every single individual in the population is available. And that's called a census. So I knew a person who decided to do a survey of every single professor at a college. She didn't take just some professors from each department. She sent the survey to every single professor. So she did not use a sample. She used a census. But in sample data, the data are only available from some of the individuals in the population. So if we go back to the researcher I described, if she had only taken some of the list, the email list of the professors at that college, then she would have been surveying a sample. And that's actually very commonly used in research studies, especially of patients. Why would you need to go get every, for example, kidney dialysis patient and study every single one? You only need a sample. And why is that? Because we have statistics. So I'm going to just give you a few examples of real population data in healthcare. You're probably familiar with Medicare. Medicare is the public insurance program in the United States for elders. So even my grandma was on Medicare when she was alive and she was not a U.S. citizen. She was from India. So we really do a good job of covering our elders in the U.S. with Medicare. In fact, I even read a statistics that said almost 100% of people age 65 and over are in Medicare. And so therefore, if you download data from Medicare, they make it confidential, you know, they just replace all the personal identifiers. But there's this thing called the Medicare claims data set for every single transaction that happens. Like if you're in Medicare and you go get some treatment, that's in there. So it has all the insurance claims filed by the Medicare population. Because it has everybody, everything, then that is population data. Also in the United States, every 10 years, the government hires a bunch of people to go out and survey a bunch of people. And also they send out a bunch of surveys. And the idea is to try to get every single person in the United States to fill out that survey. And that's called the United States Census. So now I'm going to give you sort of a mirror image of the sample data. Okay. Remember how I was just talking to you about Medicare. People who are enrolled in Medicare are called Medicare beneficiaries. And Medicare care is what they think. So they do a survey of a sample of individuals on Medicare. And they do this kind of often. I think they do it once a year. Sure. And it's a phone survey. They only do a sample because they're going to use statistics to try and extrapolate that knowledge back to the population of Medicare beneficiaries. Also, in case you noticed, the United States Census only takes place every 10 years. Do you think changes happen in between? Yep. Lots of changes. Like even just think about Hurricane Katrina. That's very sad. It changed the population distribution in Louisiana very, very dramatically. And also other states around there. So how did they keep up? Well, they used the American Community Survey. The government does this, the United States Census Bureau. And that, again, is done by phone. And that's conducted yearly. And it's a sample. And so the US doesn't know exactly how many people would be in Louisiana or anywhere else. But they can use statistics to extrapolate that from the sample of the American Community Survey. I want to just do a shout out to statistical notation. So from now on, when we see a capital N, like let's say you saw capital N equals 25, then you can assume that 25 means a population. That's just kind of a secret code we use in statistics. However, if you saw a lowercase N, N equals 25, and it was lowercase, then you could assume that this was a sample of the population. And again, it's just kind of like a secret code. You have to pay attention. When I'm talking and I say N and you can't see uppercase or lowercase, you don't know if I'm talking about a population or a sample. Now I'm going to get into the concept of parameter versus statistic. I want you to notice that the word parameter starts with P, PAH. So a parameter is a measure that describes the entire population. So for instance, anything that would come out of that whole Medicare claims data set or that whole United States census would be a parameter. On the other hand, a statistic, statistic starts with S, and statistic is a measure that describes only a sample of a population. Here we have again a situation where the word statistic is used like daily on the news. In fact, sometimes I hear on the news something like, oh, look at the rate of HIV in Africa. It's going up. That's a terrible statistic. I agree. It's terrible, but they mean parameter because they're talking about all of Africa, every single person in Africa. If the rate of HIV is going up in Africa, they mean a parameter. They don't mean a statistic. So here's an example of parameters and statistics that are based on the same population. So for example, the mean age of every American on Medicare is a parameter. That's every single person. However, remember the Medicare beneficiary survey? That's just a sample. So if we took the mean age of those people, we would just have a statistic. And again, you just have to pay attention because if you listen to the news, you'll hear them use the word statistic to mean both parameter and statistic. But in this situation with when you're practicing in the field of statistics, it's very important to point out when the number you're talking about comes from population versus comes from a sample. So you should really use the term. This is a parameter if it's from a population, or this is a statistic if it's from a sample. And so again, don't get confused. If you're listening to someone talk in a lecture or in a video, you might want to look for clues that a number is a population parameter or as a sample statistic. If you hear that the data set that they use encompasses an entire population and usually that's the kind of stuff done by governments. Like remember when I was talking about the rate of HIV in Africa, that probably be done by governments or the United Nations or the World Health Organization. So when you're talking about numbers that might have come out of an entire population, usually done by the government, that's probably a population parameter. Clues that someone's talking about a sample statistic is if you hear them talking about a study that recruited volunteers. Well, then if it's volunteers, they didn't get everybody in the population. So it's going to be a sample. Also like surveys, for instance, surveys about who people are going to vote for, you know, public opinion surveys, they're never going to ask some every single person in a state who are you going to vote for? They'll just ask a sample. So if you hear about a survey, you might even have them tell you say n equals maybe a few thousand people because that's all they surveyed. And so that's a clue that we're talking about a sample statistic rather than a population parameter. Now I'm going to talk about the difference between descriptive statistics and inferential statistics. But first I'm going to remind you what the word infer means. So infer means to kind of get a hint from something indirectly. It's kind of the compliment to imply. So if I said my friend implied that I should not call after 9pm and I figured that out, I would say I inferred that I should not call my friend after 9pm. Okay, so in inferential is what I'm going to talk about next, but first I'm going to talk about descriptive. Descriptive is pretty easy because you can do it to samples and you can do it to populations, well, variables from samples and populations, right? And so descriptive statistics involve methods of organizing, picturing and summarizing information from samples and populations. It's basically just making pictures of it, right, like look at that bar chart. And that's just a simple picture and that can be made with just about any data. You get data from serving people at work. You get data from surveying your friends, what they're going to bring to the potluck. Any of that can be used. You can go download the census data. You can make descriptive statistics out of that. But there's something very special about inferential statistics. And that involves methods of using information from a sample to draw conclusions regarding the population. Therefore, inferential statistics can only be done on a sample. And therefore, and that's why that's called inferential, right, because infer because the sample is going to give a hint about what the population is, right? It's not going to say it directly, which is annoying, right? But that's that uncertainty thing I was telling you about. So the sample is going to imply something, well, we're going to infer something from the sample about the population, right? So that's what inferential statistics is, is where you take a sample and you infer something about the population. Whereas descriptive statistics is more loosey goosey, you can just do that to samples and populations kind of like make pictures out of it, right? So in statistics, it's really important to properly identify measures as either population parameters or sample statistics, because as you can see, you can only do inferential statistics on samples. And so you have to really know what you're doing when you're doing statistics, what you're talking about, because different types of data are used for parameters versus statistics. Alrighty, now we're going to get into classifying variables into different levels of measurement. So remember our variables, right? Like we have individuals, and then we have variables about them. And those variables actually can only fall into two groups, quantitative versus qualitative. And then depending on which group they fall into, you can further classify them as interval versus ratio, or nominal versus ordinal. And I'm going to give you some examples of how to classify a few health care data types of variables. Alrighty, so I like to draw this picture. It's a four level data classification. I'll draw it solely here for you. So we start with human research data. That's what I like to start with. Alright, so we're going to split that into two. Remember I said that we're going to start by talking about quantitative. Another word that's often used for that is continuous, but we're going to use the word quantitative. So what does that mean? That is a numerical measurement of something. So like this gives an example of temperature. So something with a number in it. I always think if I can make a mean out of it, it must be a quantitative variable, right? And so here's an example of quantitative variables. So time of admin, right? So imagine that you work a shift in the ER, right? And from maybe 8pm to 12, like midnight, right? So you have this four hours, and you could say what the average time of admin would be for those who got admitted to the hospital. You know, somebody got admitted at like 8 o'clock and then somebody at 815 and whatever, you could put that together and you'd say what the average time was. Also, like if you were doing a study and you as you were saying patients with a particular condition like Alzheimer's disease, you could ask them their year of diagnosis and then you could make an average of that. And so you know that that is quantitative. Cystallic blood pressure is also numerical and platelet count. And these are variables we run into all the time in healthcare. So we're used to that. This is quantitative. Now we'll get back to our picture. So that's one side. So what if it's not quantitative? What else could it be? Well, the only other category it could be is categorical or qualitative. I use a term qualitative, but some people use a term categorical, but that's kind of what it is, is that it's a quality of something or a characteristic of something like sex or race. So here are some qualitative variables in healthcare, like you can have type of health insurance, like whether you're on Medicare or Medicaid or different types of private insurance. Those are all just categorical, right? You can't make a mean out of that. Also country of origin. If you're in a group of students and there are international students in there. Well, what countries are they from, right? Well, you can't make a mean out of that. Also, you have situations where you do have numbers involved like the stage of cancer, right? That's depressing stage one cancer, stage two cancer, stage three. Well, you never can make a mean out of the stage of cancer, you wouldn't say, well, the mean stages is 1.4 or something like that. It's just a category. And of course, stage four is a lot worse than stage one. You know, they're not just equal categories, but their categories. Same with trauma center level level for trauma center, where you wouldn't make a mean out of the number of after the term trauma center, right, like what level it is. But you could say, well, in the state, maybe so many percent of our trauma centers are level for trauma center. So it's really just a categorical variable, even though there's a number involved. All right, so let's get back to our diagram. We figured out how to take any variable and first split it into one or of two categories, either either quantitative, if it's numerical, or qualitative, if it's a characteristic. Now we're going to just concentrate on quantitative because we're going to separate those variables into two categories. And the first one we're going to look at is interval. And the second one we're going to look at is ratio. So if a if you happen to decide a variable is quantitative, then it could be interval or ratio, but not if it's qualitative. Okay, if it's qualitative, it doesn't get to do that. So let's look at interval versus ratio. So on the left side of the side, we have interval, which is where it's quantitative and the differences between data values are meaningful. And ratio has the same thing, the differences between the data values are meaningful. What does that mean by that? Well, remember how I was talking before how level one trauma center and level two trauma center that that those are really categories and not quantitative variables, because the difference actually between them is not equal. And especially if you think of job classifications that might go in 1234, like nurse one, nurse two, nurse three, nurse four, or I worked at a job where we had office specialist one office specialist to office specialist three. And you know what? The deal for going from office specialist two to office specialist three was really hard. You really had to do a lot there. But to go from one to two wasn't that hard. So that was a categorical variable, right? Because the differences between the values were meaningless. Okay, like the difference between OS one and OS two versus OS two and OS three, they weren't equal. Whereas when you're dealing with a quantitative variable, regardless of whether it's interval or ratio, you're talking like years or systolic blood pressure. One year for you is one year for me. So that's fine, right? But here's where the difference comes in between interval and ratio. So all quantitative variables have meaningful differences between their data values. But this hair splitting thing here is that in interval, there is no true zero. And in ratio, there is a true zero. And this is how I try to think about it. An interval means kind of like a space between two things. Like if you think of the word intermission is kind of like an interval, it's like an interval of time during a show where you get to get up and go to the bathroom and get some coffee. So that's interval. And so if you have something that's a space in between, that's not going to have a zero, it doesn't really start anywhere or end anywhere, it's in between. Whereas ratio, how I remember that is, I don't know if you remember from like high school, but you can't have a zero on the bottom of a ratio or a fraction. So that's the way I use a mnemonic that ratio means that you cannot have a true zero. But how does this work out literally, we'll show you. So let's go back to those examples, I showed you of quantitative variables, right, because those are the only ones we have to make this decision. about whether they are interval or ratio. So these are these examples. Now, I'm going to remind you that ratio has a true zero. Remember that little mnemonic I said, like, don't divide by zero. And so, you know, like an ratio, so they have a true zero. Well, let's think about it. It's not very pleasant to have a zero systolic blood pressure, because you'd be dead. Same with the platelet count. But it is possible, right? But now, when we go on to interval, we can't have like zero time, like time of admit, you know, or year of diagnosis, there's no like, year zero. So as you probably just guessed, ratio is where it's at in health care. There's not a whole lot of times when we have interval data. But we do, you know, anytime you have a time, so you've got to keep that in mind that if you want to split your quantitative variables into either interval or ratio, you got to keep this in mind the difference between the true zero and the no true zero. Okay, here's our handy dandy diagram. We've just gone through the tree, classifying quantitative data into interval versus ratio. Now, let's go pay attention to the other side of the tree qualitative. So how do we split those? Well, we can split those into nominal versus ordinal. All right. So nominal applies to categories, labels or names that cannot be ordered from smallest to largest. Okay, like I kind of think of when they have an advertisement, they say for a nominal fee, you can do this. It means it's small, like there's almost no difference. And so that's why I say, there's no difference, it's not smallest to largest, it means they must be equal. That's how I remember it in my mind. But then ordinal applies to data that can be arranged in order and categories. But remember that thing I was saying about quantitative, it's not quantitative, right? Because the difference between the data values, either cannot be determined or is meaningless, like I was talking about with cancer, especially, you know, if you go from stage three to stage four, that's materially different than stage one to stage two. So you really can't determine those things. So this is where we're going to get into that it's ordinal, it's arranged in categories that can be ordered from smallest to largest. So remember our old friends that I threw up there before of these examples of qualitative variables in healthcare. Well, let's just reflect on this nominal cannot be ordered, right? So that would be more like type of health insurance and country of origin, because they could all be equal. Whereas ordinal is going to have a natural order, even though the differences between the levels is meaningless, which is what makes it so different from a quantitative variable. So which is why it stays on the qualitative side of the tree, it just gets labeled ordinal. So what you want to do is if you think you have a qualitative variable on your hands, look for a natural order. If there is one, it's ordinal. And if not, it's nominal. So all data can be classified as quantitative or qualitative. So if you have a variable, that's the first split you can make is the difference between quantitative and qualitative. But once you do that, you can further classify it as interval, ratio, nominal or ordinal. And it's really important to know how to classify data in healthcare as you'll find out later, because depending on how you classify it, you might be able to do different things with it in statistics. Alrighty, so what we went over was the definition of statistics. And we talked a little bit about why you use it and how you use it, especially in healthcare. We went over what it means to talk about a population parameter in the sample statistic. And we went over some examples about them. And then we talked about classifying variables into the different levels of measurement, and even talked about a few examples there. So I hope you enjoyed my lecture.