 Hello everyone. Today we're going to talk about sampling and data, basically introduce statistics, what it is, why we use it, how we use it, how do we get information we can use for statistics and what can it tell us. So first off, when do we use statistics? We use it really all the time. We get information all the time. We get these statistics all the time from things like newspapers, TV, internet. Think about the weather. It was recently quite hot in Korea and we might hear on the news that this is the hottest year to date and basically they're using other data points from prior years to be able to make some claim about the current state or the current situation right now in terms of weather. On internet you might hear statistics about some drug producing bad side effects if you take the drug or some drug that cures every element in the world and they give some statistics. So for example, 95% of people were magically cured by this pill. So we get statistics all the time and what I'm also trying to illustrate here is that the statistics that we get from things like newspaper, TV and internet, we're constantly getting this information but it might not be either correct first off or complete or completely understood. So there's a lot of problems. Let me rephrase this. We use statistics all the time but there are a lot of problems with the quality of the statistics that we see normally in newspapers, TV and internet. So part of this is a problem of the people who are consuming these sources not knowing how to properly interpret data but part of it is also a problem of the newspapers and TV and internet and other sources that are attempting to display the data in a way that says something that it doesn't actually say. So I guess when do we use statistics? We use statistics whenever we're trying to make decisions. We get them from a lot of different sources and we get statistics about, for example, crime and sports and education, politics, real estate and lots of other things. There are statistics or there can be statistics about pretty much everything you do. Most choices you make, you are thinking about the statistics at least in the back of your mind. So we can use statistics from different sources to, for example, say, is the crime rate in Seoul greater than the crime rate in Chuncheon? Is it safer, for example, to be in Chuncheon? We can use statistics to answer that question. We can use statistics to figure out what major, for example, at university is more likely to get a job in, I don't know, fighting crime or is more likely to get a job overseas or is more likely to get a job as a doctor, something like that. We can use statistics or data points to be able to predict the likelihood or the outcome in the future. And there are statistics, again, about everything. We really rely on statistics to make most of our decisions. So it helps you to make a decision about the correctness of some statement. So really what we're doing with statistics is attempting to make decisions and we want to make hopefully correct decisions. So statistics helps you to make a decision about the correctness of some statement. So for example, or not for example, just for practice, how much do you sleep? I find how long university students don't sleep nearly enough. You probably should sleep much more. But think about how much do you sleep? And we have this little chart here that shows how much people sleep. So this is, for example, between five hours and nine hours and the circles here represent one person. We can see that the majority of people or the most common hours that people slept would be about six and a half hours. So we kind of have this grouping around or between six and seven hours. That's where really most people were falling with five being kind of an extreme and nine being an extreme and actually very few people sleeping eight hours, which is the recommended amount of sleep, by the way. So think about how much you sleep and think about the samples that were collected here. Each of the circles are what we call samples where, for example, one student was asked, how much do you sleep per night? And they responded and then they got a circle. If they said, for example, six and a half hours, then one of the circles represents one sample. So does your dot plot look the same or different from this example? So if you were thinking about your sleep, your sleep, for example, on a weekly average, let's assume that before you started this semester, whenever you were at home, you might have slept, for example, nine, 10 or more hours per night maybe. And then now that you are in classes, you probably sleep a little bit less. So think about your average weekly sleep. What do you think it would be? If you sampled the number of hours per night, would your chart look similar to this or not? So for example, in this case, maybe the nine hours, if this is all the samples from the same person, then maybe nine hours was on a weekend. Eight hours was on days they didn't have class. And then most of the time, if they had class or if they were studying or whatever, then they would have six, six and a half hours, maybe seven hours of sleep. So think about your sleep patterns. Would you make a plot or think about the sleep patterns of you and your friends, where you sample not only yourself on a weekly basis, but maybe you sample for one night all of your friends on that night. How much did they sleep? Would you get about the same plot or not? If you did the example in an English class with the same number of students, do you think the results would be the same? So in this case, you ask each student, how many hours did you sleep last night, right? Because we want to measure the same night that they slept. So if you did the same example in an English class with the same number of students, do you think the results would be the same? Why or why not? And in this case, we're looking at groups. So this is a statistics class. Let's assume that we sampled every student in the statistics class and we got this plot. Then we went to an English class, still on Hallam University campus, still Hallam University students, but we went to an English language class. Do you think that their plot would look the same or not? And what are some reasons that it might look different or what are some reasons that it might look the same? Well, some reasons that it might look the same are that they're all still Hallam University students, they probably have, for example, the same after school activities or partying or whatever it is that Hallam students do after classes. And on average, I think most university students sleep about the same amount of time per night with some extremes on both sides. So it would be a very interesting question or potentially interesting. Is there a difference in sleep patterns between statistics class students and English class students? And if there is a difference, what is that difference? What causes that difference? If there's a difference in sleep, then we can start to use statistics to ask other questions like if the English class students sleep more, do they actually do better in class? Do they do worse in class? Right? So there's a lot of different questions that can come up from this. But think about here, we are trying to measure two different groups to see how similar or how different those groups actually are. I'm just off the top of my head guessing most likely the plots would be very similar for these classes just because all university students tend to have relatively similar sleep patterns. Now if we asked, for example, the students and the professors about their sleep, how much sleep they got, I think it would be very different because those are two relatively different groups of people that have potentially different sleep patterns. Where do your data appear to cluster? And how might you interpret this clustering? So a cluster is where we have groups of data together. So on this plot, where's the data clustering? Basically, I see a cluster between, you can say between six and seven hours. You might include five and a half hours, but I would probably say clustering really starts at six and seven hours. And what does this mean? This means that the average sleep time is somewhere between there, most likely. So the majority of people or this group of people have something in common that, for example, the person at nine hours does not have in common. So the nine hour person is quite different from where this clustering is happening. And really the five hour person is quite different as well. So again, we can start to ask other questions. We have this clustering with kind of potentially the average or let's say what the normal part of the group is sleeping at, the number of hours that group is sleeping at. And then we have these kind of extremes. So then we can start to ask questions like, well, what's causing these extremes? Are these people overworked? Are they stressed? So they can only sleep a few hours a night? Or maybe they're stressed so they sleep more? What exactly is causing these kind of extremes outside of the clustering? So clustering can tell us a lot of things. In this case, yeah, I mean, it's just a way to figure out what is kind of your normal group within your data set. So some definitions. First off, what is statistics? Statistics deals with the collection, analysis, interpretation and presentation of data. Remember collection, analysis, interpretation and presentation of data. We have to collect raw data. So for example, like the plot we just saw, we sampled, we asked each person, how many hours did you sleep last night? So we are using that question to collect data. Once we have that data, we need to put it in a form that we can analyze or that we can say something about. So analyze, I mean, what does this data tell us already? Kind of explicitly or implicitly, what does this data tell us? And then once we've analyzed it, we can interpret what that actually means. So in the last case, we analyzed where the data points were clustering at. So once we understood where the cluster was, we analyzed the data, we know where the clusters are, then we can interpret what that clustering actually means. And then really we're interested in what the interpretation is because that tells us what we think is causing that, what attributes are causing this clustering or this interesting data point to happen. And then finally, presentation of data, we have to present our interpretation or present our results that way we can actually say what this data tells us or what this data should tell us. So statistics deals with the collection, analysis, interpretation, and presentation of data. If you don't have good data collection, you will have bad analysis and bad interpretation. If you have good data collection but bad analysis, you also have bad interpretation. If you have good collection and good analysis and bad interpretation, then your presentation might look good but might be false actually. So all of those things are actually very, very important. And really most people fail in statistics at the collection phase. There's a lot of techniques for analysis and analysis can be complicated, but really most people fail because they improperly collect data. And once you've collected data improperly or you have the wrong data, then anything that you base, any information you base off of that data is likely to be lower quality or just incorrect. Descriptive statistics is organizing and summarizing the data itself. Descriptive statistics is probably the easiest form of statistics that you can do. It doesn't really require a lot of analysis. You just have to organize the data and say something about the data itself, and we'll talk quite a bit about that. Inferential statistics is where it gets more complicated but also much, much more interesting than descriptive statistics. It's where formal methods for drawing conclusions from good data. Like I said, you have to have good data or high quality data to be able to make true, correct conclusions or interpret the data correctly. So inferential statistics were actually drawing conclusions from some type of analysis concluding things about the data. Statistical inference uses probability to determine how confident we can be that our conclusions are correct. So let me let me repeat that. Statistical inference uses probability to determine how confident we can be that our conclusions are correct. Now think about this. We can use probabilistic methods to determine how true or how confident can we be that some statement that we make is true. And that's the very powerful thing about statistics. If we have enough data and we usually need quite a bit and that data is very high quality, then we can say we can make statements about the data and measure how confident we can be about them. So we can essentially measure how wrong or how right is it? How likely is it to be wrong or right? And we'll talk more about inferential statistics a lot and how we calculate how confident we can be about our conclusions. So next probability. I think everyone knows this is a mathematical tool used to study randomness. A population is a collection of persons, things, or objects under study. Now we'll deal with populations quite a bit and so just realize that a population is a collection of person, things, or objects under study. A sample is a portion or subset of the larger population. So for example, we might have a population, which is all Koreans, right? But if I'm going to do a study, I am not going to ask every single Korean in the world about my question. I'm not going to be able to find every single Korean. So I need to take a sample, which is a portion or subset of the larger population, which means if I ask a large number of Koreans, but not every single one, then that group, that subset of Koreans that I actually ask, serve as my sample for this study. Now, if we could ask the population, then our data would be very high quality, very accurate. But it's completely, in the case of asking every single Korean in the world, it's impractical, right? So in almost every study, you'll find that people are sampling because it's impractical to ask the entire population. There's a lot of problems and a lot of considerations that come with sampling, and we'll talk about them. But for now, understand that the population is the object or the person thing or object under study, and a sample is a portion or subset of that larger population that we actually ask or actually study. So sampling is selecting a portion or subset of the larger population and study that portion to gain information about the population. We're trying to understand here the population, right? But surveying or studying the entire population is most likely impractical, so we have to select a subset of the entire population to do our sampling, right? So sampling is selecting a portion of the larger population to study that portion to gain information about the overall population. So think of it like, let's say, every cancer patient. We want to cure cancer, and there's a new drug that cures cancer. We obviously can't test the drug on every single cancer patient, but we can do a study of, let's say, 100 cancer patients with a specific type of cancer. So our population, our people, every person with that particular type of cancer, and our sample are these 100 people that we can actually test the drug on, right? So just think there is a difference between populations, the greater thing. Sample is a subset of that, and sampling is actually asking or working, studying with the subset to gain information about the overall population. Now a statistic is a number that represents a property of the sample, and we'll talk more about what a statistic is, but for now a number that represents the property of a sample, a parameter is a number that is a property of the population, right? So here we're talking about, again, samples and populations. So a statistic, a number that represents a property of the sample, which is the small group of people that we sampled, and a parameter is the number that's a property of the population, not just the sample, the entire population. So we're trying to find information or attributes about the sample. So our statistic, for example, describes the sample, but that statistic might not necessarily describe the population, depending on how we did our study. A parameter is a number that is a property of the population itself. Now a representative sample is a sample containing the characteristics of the population, and this is the critical piece, right? So we have a population, let's say, all Koreans in the world. We have a sample, which is 100 Koreans that I randomly found. If I walked around Hallam University campus, and I just came across the first 100 Koreans, is that sample representative of the entire Korean population? No, obviously not. And the reason is because the first 100 Koreans that I find on Hallam University campus, first off, all go to the same university. They're most likely college age students and potentially some professors. They may or may not have worked before. They really, it doesn't describe the entire population of Korean people at all. It describes well, the population of Hallam University students. I won't even say university students in Korea, just describes Hallam University students. So we really have to think about what group, what sample group actually represents the population, what sample group has the characteristics of the population, and how can we make sure that we're actually making samples or finding samples that represent what we're trying to study properly. And that's, it turns out to be very difficult sometimes, especially if the population is very diverse. So a variable noted by capital letters such as x and y is a characteristic of interest for each person or thing in a population. So think of a variable as some characteristic that we're interested in. How many, Hallam University students have blonde hair, for example. Blonde hair might be defined as variable b or something like that, so we use variables to denote some characteristic that is changeable or something that we want to look for in a population. Numerical variables are values with equal units such as weight in pounds and time in hours. So I think that's relatively straightforward. Numerical variables, just think of them as numbers, something we can actually measure. So weight in pounds, time in hours, time in minutes, age, the amount of time that it takes to run around the track, the number of patients that go into Hallam University Hospital, something like that, something that we can actually measure. Categorical variables are attributes that place the person or thing into some sort of category. So before we had numerical variables, and these are things that we can actually measure. Categorical variables are things that let us categorize things. So for example, eye color, hair color, who has a Samsung laptop? It really could be anything, but you're trying to group whatever we're looking at based on some attribute. And we can't necessarily measure that attribute. You can't measure hair color. I mean, you could potentially measure how yellow is it, what are the wavelengths coming off of the hair, but really we're talking about categories. So for example, I don't know, Korean, Thai, American, Irish, well, I mean, like those are different categories of nationalities, for example. And we want to measure the people inside that. So we can say, for example, four, four Irish, 10 Korean, five American, whatever, right? So here we measure the amount of people or objects in some sort of category. So one is an actual measurement. One is categorization, putting something into a category. And both methods are very, very powerful for doing statistics. And data is the actual value of the variable. So we have some variable, which again, represents some sort of characteristic of interest. And once we measure, once we actually ask a person a question or collect a data point, once we measure, we find out the actual value of the variable that we're interested in. Mean basically just means average. If you have, for example, x, y, z, you have three, three values, you would just divide by three to calculate the mean. We use mean for a lot of different things, but it basically tells us what is the most common or the kind of average over all of the data that we've collected. Proportion is a percent of the whole. So proportion of, for example, a population, let's say we collected samples from 25% of the entire population, then we have a proportion of the population that we collected samples from. We use proportion quite a bit whenever we're describing the data that we've collected. So for example, class size 40, we have men 22 and women 18. The proportion men is 22 to 40. Proportion of women is 18 to 40. And we can calculate the percentages that way. Okay, so those were some definitions and I'm sorry we had to go through that, but usually definitions in English are a little bit different than in Korean, or at least I want to get you used to the terms in English. So I thought I would go through them. They are also in the book, so make sure you read chapters one as soon as possible. So now some practice. We want to know the average amount of money first year college students spend at ABC College on school supplies that do not include books. We randomly survey the first 100 year, 100 first year students at the college. Three of those students spent $150, $200, and $225 respectively. So what we want to know is first off, what is our population? Think about in this case, what is the population? We want to know the average amount of money first year college students spend at ABC College on school supplies that do not include books. What could our population be? I'll let you think about that. What is our sample? So we randomly survey 100 first year students at the college. Three of those students spent $150, $200, and $225 respectively. What is our sample? What is our parameter here? What is our statistic? What is our variable? So one of what we're doing is study. We want to at least identify these things. What is our population sample, parameter, statistic, variable, and of course data. What data do we have once we've actually collected information? So here the population is all first year students attending ABC College this term. Now notice I'm very, very specific about first year students attending ABC College this term because if I said all first year students, that would potentially mean all first year students in the country or the world or wherever, if I said students attending ABC College this term, that means you know freshmen, sophomores, juniors, seniors, everyone basically every student attending the college, I have to be very specific about who am I measuring? What is the population that I'm measuring? And in this case it is all first year students attending ABC College this term. The sample could be all students enrolled in one section of a beginning statistics course at ABC College, although the sample may not represent the entire population. So sample could be all students enrolled in one section of a beginning statistics course. Let me go back. Yeah, randomly survey one of the, we randomly surveyed 100 first year students at the college of those students we spent 100. So in this case the sample could be all students enrolled in one section of a beginning statistics course. In the example they said the sample is 100 randomly selected students, right? So I was thinking something else whenever I wrote this. The sample here is 100 randomly selected students as long as they were first year students attending ABC College. The parameter is the average amount of money spent excluding books by first year college students at ABC College and this this parameter is what we are trying to determine. This is what we're trying to find out by doing this this survey or this analysis. The statistic is the average amount of money spent by first year college students. So what we want to know the parameter is the average amount of money spent by first year college students at ABC College this term. The statistic is the average amount of money spent. It's actually the amount of money spent by first year college in the sample. Now remember a statistic deals with the sample not the population. The variable could be the amount of money spent by one first year student. So let X be the amount of money spent excluding books by one first year student attending the college. So here we have a variable and each student will potentially have spent a different amount of money. So we have a sample and let's say X is the amount of money that that person spent. We might have several samples and X will change with each sample they have. So our variable here could be the amount of money spent by one first year student and the data are the dollar amounts spent by the first year students. Examples of data are 150, 200, 225. So the data that we've collected are all of the dollar amounts and those dollar amounts per sample are a variable. This tells us or this gives us some statistic or we can calculate some statistic about the overall sample. These 100 random randomly asked students to calculate basically a parameter of the actual population. So more definitions. Qualitative data is the result of categorizing or describing attributes of a population. Again qualitative we're dealing with categories not necessarily measurements. Quantitative data, the result of counting or measuring attributes of a population. Quantitative data we're dealing with numbers we're dealing with measurements. Quantitative discrete data, data that's the result of counting for example the number of books in a backpack and quantitative continuous data is data that's the result of measuring. So discrete we have data that's the result of counting. Continuous is data that's the result of measuring. Both of them of course will give us some numerical value so it is quantitative not qualitative but we can differentiate between discrete and continuous data. So some ways of showing qualitative data, qualitative not quantitative as in categories are for example pie charts and bar graphs. Because we can say you know such a percent of the population has blonde hair such a percent has black hair and we can have a pie chart that shows that percentage or a bar graph that compares the two. So in this case we have a pie chart let's say full time and part time workers. Part time we can see is the majority at least in the left hand side actually for both of them. Yeah and this just tells us the the percentage differences between full time and part time with qualitative data. These are categories, two categories here part time and full time. Student status again full time and part time. Qualitative data again and we have some measurement about the the number of students that are full time and part time in particular campuses. Percentages you can use bar charts to show percentages although they can be a little bit confusing. Think about who you're actually marketing this to sometimes it's just easier for example we have this table 1.3 at the top we have the characteristic in category and the percent and notice that if we add all of that up it goes over 150 percent so whenever especially if we don't have something that adds up to 100 percent it can be a little bit confusing so make sure that whenever you're describing this data you explain you know what what are they trying to do. Here we can omit data for example an other category yeah we'll talk about when when to omit data later but just know that you can. Omitting data other category included so in this case we have another category that was included versus the prior slide where it wasn't included if we have other unknown maybe that doesn't tell us anything it doesn't add any additional information to our study and sometimes it does so we have to decide is this relevant to our study or not should we show it to our our users. Okay so next sampling a sample should have the same characteristics as the population it's representing we've already talked about this so the population itself is you know a group of things with some attributes if you sample those things and you pick a subset that let's say is very strange maybe it's you pick I don't know a USB stick let's say you have a bunch of USB sticks and they're all over four gigabytes right but whenever you sample you pick one USB stick and it's one gigabyte instead of four gigabytes well the the one that you've sampled does not represent the entire population because it's actually the only small one and everything else is much bigger than it so the sample should have the same characteristics as the population it's representing so we have to figure out ways to be able to sample that gives us that representative represent the proper representation this of course can be very hard so the best way to do this is to use random sampling so in random sampling each member of a population initially has an equal chance of being selected for the sample so we have a population of things that we want to sample and each member of the population initially has an equal chance of being selected for the sample we'll talk about this a little bit more when we're talking about probability but let's say that I have you know a hundred people that I that I can sample from I want to do random sampling and randomly select everyone with an equal chance and that's that's really the best way to get a representative sample as long as your sample set your subset does actually represent the population a simple random sample each sample of the same size has an equal chance of being selected so again what we're trying to do is make sure that the probabilities of selecting a sample are the same for every possible sample that I can select and we'll talk more about this as well a stratified sample we divide the population into groups called strata and then take a proportionate number from each stratum so rather than being completely random we just divide the population and we have to figure out how to actually divide the population or stratify the population and then take a proportionate number from each stratum and this depending on how we do it will give us a representative sample of each strata that we can then compare with each other to make sure that we are getting representative samples cluster sample we can divide the population into clusters or groups and then randomly select some of the clusters so again just how we divide the population is potentially a question it's probably done randomly we divide the population into clusters or groups and then randomly select some of the clusters to be our representatives systematic sample we randomly select a starting point and take every nth piece of data from a listing of the population so in this case it's also it's not completely random but we try to kind of distribute our sampling throughout the entire population data that we've collected each of them have of course some strengths and each of them have weaknesses the easiest I would say is complete random sampling it's the easiest to do it's the easiest to manage and it tends to give the best quality data as long as you have large a large population and a large sample set so sampling non-random convenience sampling uses results that are readily available available and this is actually what most studies tend to do so a lot of data has already been collected from a lot of different places we might have data from you know prior studies or the newspaper or you know internet or wherever and the data has already been collected so people tend to use that and because they can't really get more data of that same quality or of the same type maybe they they can't collect the data themselves they can only get it from another source they use non-random convenience sampling and this uses results that are readily available or data that's readily available true random sampling is done with replacement so I talked about random sampling which is every let's say in this subset where I can sample from any any object or any person or whatever in this subset true random sampling is done with replacement and what that means is that if I sample let's say I have a hundred people and I select one person to be a sample I measure you know their variable I find out what their variables are and get the the values of their variables then I have the option of keeping that person out of the sample and now my sample has 99 people in it or my subset has 99 people in it or I can put that person back in this in the subset and the subset has 100 people in it now to make sure that we have exactly the same probability of choosing someone else or we have exactly the same probability on our next sampling I need to put the person back in the sample to make sure that my probabilities are the same so true random sampling is done with replacement that gives me the same probability every time I sample many stuff studies are done without replacement and that essentially makes how can I say it lowers the the chance or it makes the chance of pulling the same sample again essentially zero and it increases the likelihood that other attributes are selected so true random sampling is done with replacement many studies are done without replacement it's not necessarily wrong but we just have to be aware what replacement or without replacement does to our studies so sampling errors is error caused by sampling like the the set is not large enough so the amount of samples that we've taken the amount of data that we've collected is not large enough we will start to get lots of errors because we just don't have enough data to make any conclusions or any analysis properly so sampling errors can be done because of the sampling process either we don't have enough data or the way that we sampled was not representative of the actual population so sampling and how we collect the data it's actually really important for making sure that we have high quality yeah high quality data to work with so we don't get a bunch of sampling errors essentially uh non-sampling errors are factors not related to sampling uh for example defective counting devices anything that was not directly caused by the way that we set up our samples and the way that we analyzed our samples these could be external factors sampling bias is created when a sample is collected from a population and has some members of the population that are not as likely to be chosen as others and this also happens in studies very very often so uh for example um if we're in korea and we are testing a new drug in korea most likely that drug is going to be tested on koreans uh it's it's possible that some foreigners might be in the study but most likely it's going to be koreans because the population of korea is mostly korean so in this case the sampling bias is towards korean physiology um so the result could be that we might make statements like this drug cures um this disease and 99 percent of patients but maybe that's because of uh something to do with korean genetics right so um it might cure in koreans 99 percent of of this ailment but in foreigners or um let's say japanese chinese westerners whatever um the percentage is actually much lower because of some genetic reason i'm not sure what that would be but i think you understand my point here we have a sampling bias so if our sample uh does not represent the actual population or some members of the population are not as likely to be chosen and sampled as others then um it doesn't reflect the overall population and that's can cause a lot of problems to occur variation are differences in samples um we normally have well depending on what we're working on we we may have very little variation for example if i was going to measure the size of my desk um i would measure it over and over again and the variation would exist but it would be very very small right so um every time i measure i might be off by a very small amount uh less than a centimeter something like that um so i would have variation in my measurements but it would most likely be very small um in populations especially when dealing with people uh we have huge variation over a lot of different variables so it really depends on what you're measuring uh variation is normal and um variation can be very interesting in studies uh so just realize that if we're doing precise measurements maybe there's not a lot of variation but if we're dealing with you know people sociology potentially medicine things like that then uh we're likely to have lots of variation in our samples so critical evaluation we want to evaluate statistical studies we read about critically and analyze them before accepting the results and a lot of people don't do this at all really i can't stress enough you need to evaluate studies that you read about so if you're reading the newspaper and they're showing you a graph um and then they're telling you something in the article does the graph actually match that article does the graph have enough information in the graph itself to to lead you to you know real conclusions or does the graph just look good but it doesn't actually tell you anything and i've seen a lot of um charts and graphs and newspapers and on the news and on the internet and things that uh look very convincing but whenever you think about it or whenever you actually try to analyze the graph it doesn't really tell you anything but because it looks good people don't really critically evaluate critically analyze these graphs um and that's a big problem because a lot of our society just doesn't really analyze the data and makes their own conclusions based on sometimes incorrect data it's very dangerous so some problems problems with samples a sample must be representative of the population we've already talked about representative samples it must be representative of the population if we want correct true accurate good data self-selected samples responses only by people who choose to respond such as call-in surveys are often unreliable so a very interesting thing about most of the the surveys that i see done by korean tv and newspapers is that it's basically either done during daytime when most people are at work and they're calling people's homes so then only people who are at home during the day will be called so what you get is this kind of self-selected sample of only old people or the older generation who stay home will answer the phone and answer the questions whereas the younger generations are at school or at work and they're not represented in in this data at all really um likewise people who are doing surveys online uh if they're surveys online it's mostly going to be young people um it's uh younger people i should say it's not really going to be like uh you know 75 or older people doing surveys online it's going to be younger people who are answering those surveys so it's skewed the other way whenever we're dealing with um self-selected kind of online surveys or anything to do with technology sample size issues uh samples that are too small may be unreliable it really depends on the type of study that we're doing but uh basically if you have less than 30 samples for anything and let's say you're doing a very specific study if you have less than 30 samples in a very specific study uh you you don't have enough data basically um if you're doing a very large study with a lot of different variables involved then 30 is not enough either uh larger samples are better if possible of course always get as much data as possible uh undo inference uh undo influence sorry uh is collecting data or asking questions in a way that influences the response so i also see this a lot of times in a lot of different media um people asking questions in ways that makes people either assume the answer or um influences how they answer uh or what they think about things so be very careful about how you're asking questions because the way you ask questions will force people to answer in certain ways even if you don't realize you're doing it you need to evaluate how you're asking your questions uh non-response or refusal of subject to participate the collected responses may no longer be representative of the population so a lot of groups do not talk about their non-response rate if you have a very large non-response rate um it could mean something is is uh there could be a problem or the people who actually did respond are not actually representative of the population if your non-response rate um is is very large so consider why or how many people did did you ask that didn't respond um how many people just refused to respond um and consider about why they might have done that um and then think about are the people who responded uh greatly different than the people who didn't respond because it might say something about your population causality so a relationship between two variables does not mean that one causes the other to occur this is also a huge problem um uh that's a lot of different studies have causality um um what do we say correlation does not equal causation basically we can have two variables that are highly correlated or look like they have a strong relationship to each other however uh just because they look like they have a strong relationship to each other each other does not mean that they actually do um so a relationship between two variables does not mean that one causes the other to occur um and we'll talk more about what that is later um also be very very aware of self-funded or self-interest studies so self-funded or self-interest studies um are usually people with some sort of agenda they want to say something very specific and it might not have anything to do with reality actually if it is self-funded or based on some sort of self-interest um then people will make false claims and then pretend that it's it's true um and what we're doing with statistics is trying to find the actual truth not what not the truth that people want we want the real truth right um so self-funded or self-interest studies tend to be extremely biased um so be be aware of them and of course misleading use of data people can use data and change the way data looks to make it appear um either better or worse than it actually is and uh you know media does this all the time because they want to sell more papers or whatever it is that they do uh they use data in incorrect ways or make data look a certain way so they can force their own conclusions on it and then confounding when the effects of multiple factors uh on a response cannot be separated so there's a lot of different factors that can affect um some something that we're trying to study so for example uh let's say that i want to eat lamyeon what are the factors that make people eat lamyeon on a normal day right well you know maybe maybe one factor that makes people eat lamyeon more is if it's you know rainy and cold uh people want something you know warm and hot in their belly so uh the weather could be one thing but it could also be uh you know did their mom make lamyeon for them whenever they were young so they have fond memories of it so that's another factor right so we can find you know the action is eating lamyeon and we have multiple factors that affect um what makes a person want to eat lamyeon so whenever we have these multiple factors that affects uh whatever whatever response we're looking for um it's very difficult to pick out which one um or it can't can be difficult to pick out which one's actually having an effect in this case um and they might all be affecting at the same time at the same rate okay levels of measurement uh so levels of measurements is the way that a data set is measured of course nominal scale are like qualitative categories color colors names labels favorite foods yes or no responses age et cetera uh ordinal scale like nominal scale but can be ordered list of the top five national parks in the united states for example interval scale similar to ordinal temperature scales like celcius fahrenheit are measured by using the interval scale and ratio scale like interval scale but it has a zero point and ratios for example four multiple choice statistical final exam scores are 80 68 20 and 92 out of possible 100 points so um yeah ratio scale all of this again is in the book um just be aware of what these different scales are what we're talking about here is just measurement how can we measure different things what are the types of measurements we can actually make and what are those what are those called whatever we're doing those measurements uh frequency is the number of times a value of the data occurs so for example the number of times that a female answered the question the number of times that someone called a foreign phone number something like that relative frequency is the ratio of the number of times a value of the data occurs in the set of all outcomes to the total number of outcomes right so the ratio of the number of times a value of the data occurs in the set of all outcomes to the total number of outcomes to get this we can just divide each frequency by the total number of students or whatever it is that we're measuring in the sample so here we have a data value uh frequency and relative frequency so uh right we'll we'll actually talk about this a little bit more later okay all right okay okay so finally getting to experimental design and ethics uh the purpose of an experiment is to investigate the relationship between two variables we are trying to um whenever we're doing an experiment investigate relationships between two things um first we'll talk about uh you know descriptive statistics which just tells us about the data but whenever we're actually doing experiments we're attempting to find the relationships between two things what causes what what affects what uh so explanatory variable um is when one variable causes change in another right so one variable that causes change in another is the explanatory variable and the response variable is the variable that is affected by the explanatory variable so we have an explanatory variable and that is the variable that we believe causes the change right so if we modify the explanatory variable let's say we give it different values then we would expect the response variable to respond with different value outputs uh treatments are different values of the explanatory variable so we have multiple treatments these treatments uh essentially change the value of the explanatory variable and we observe uh the response variable so we modify explanatory variable observe the response variable and that is considered a treatment okay control group a treatment group that receives no treatment so um we also need to figure out what is the possibility or probability of a group um or an event occurring in the group by itself without modifying the explanatory variable is it possible that a group might you know uh regrow all of their hair naturally even though whenever we do our study we measure our our treatment regrows their hair well what's the probability of hair regrowing naturally something like that so our control group is actually a treatment group that receives no treatment at all and what that means is that um we can observe a sample of the population um whenever they are just basically nothing is happening to them and we can compare the group that had nothing happened to them to the group that did have something happened to them and see what the difference is uh a placebo maybe you've heard of is a treatment that cannot influence the response variable so uh in medicine a lot of times we use uh sugar pills as placebos and what this does is make people think that they are taking uh a medicine or a real pill and because people think that they're taking medicine or a real pill they actually have changes um very very slight changes but some changes in their body so we use placebos to see you know if we give somebody a sugar pill versus a real uh a real pill what's the difference is placebo does the real pill work the same as a sugar pill if it does then that means that the real pill doesn't actually do anything because the sugar pill doesn't actually do anything so if we have a control group this is a group that doesn't receive any treatment if we have a group that is given for example a placebo uh this is a group that thinks they're taking a pill but um they're not essentially and we can measure if there's any differences between the control group and the placebo group and then we have the actual experiment we have the actual group that takes the real pill and we can see how different was the real pill from the placebo or from the control group and if there is a significant difference from control or placebo then that means that our pill is actually doing something if there's not a significant difference that means our pill is not doing something uh so having a control group and a placebo is very uh in statistics you should always have a control group and placebo um assuming that you're doing certain types of tests which we'll talk about later okay uh the widespread misuse and misrepresentation of statistical information often gives the field a bad name people don't really trust statistics because um uh some groups manipulate statistics to say whatever they want um the only cure for people manipulating statistics to say whatever they want is for the average person to know how to read statistics and be able to actually look at this and say hey this this isn't right um it's never acceptable to falsify data although a lot of different groups especially in korea are falsifying data um it's never acceptable it's completely immoral uh to falsify this data because you don't know how that data is going to be used um sometimes whenever you for example collect data if you change it or modify it from the weight the value that it actually is it could um let's say a few years later be used to evaluate um whether you know aid should be given to a certain country and because of your data it looks like a the country is doing very well so they don't need any aid or um should there be more health services in that country well if you lied in your data then people might say well this country doesn't need any more health health services or health care or you know women's rights or whatever right so it's never acceptable to falsify data um it's actually completely immoral and it can cause long term uh damage to society if you do that depending on what type of data you're looking at uh review boards are normally set up to ensure that researchers are not abusing participants and do not alter uh data so review boards uh mostly in universities although with my experience i found that the review boards are sometimes relatively lax i don't work in medical or anything like that i do work in social science but um the review boards don't appear to be as strict as i think they should be um so just take it on yourself do the right thing and do not falsify your data um you know everyone makes mistakes in data collection sometimes just don't do it on purpose so that is it for um you know uh basic definitions and some information about statistics uh we will get much more into um descriptive statistics uh in the next lecture thank you very much