 Normaalinen research study, we cannot study full populations because of practical issues. Therefore, we have to rely on a sample. When we take that sample, there are multiple things that we need to consider and a number of things that can go wrong, that can either produce results that are biased or results that are inefficient. So let's take a look at some issues related to sampling. First, we have a population. That population is the thing that we want to study, so we want to say something about the population. And let's say we are studying that population using a survey that we mail to companies. To send out the invitations to participate, we have to have an address to every company. So we have to have some kind of operational definition of our population. So we call the operational population a sampling frame. So the sampling frame is an actual list of companies or people or whatever things we are studying and the population is the conceptual definition of the thing that we are studying. Then from the sampling frame, it can be if we are studying individuals in Finland, for example, the sampling frame could come from the population register and it could contain millions of people. So then from the sampling frame, we actually take the sample. Typically, we choose people randomly. So a random sample is the most simplest way of taking a sample and it is often the most desirable way as well. Then we send out our survey and some people choose to participate. Some companies choose to participate. Others choose to not participate. So we get an actual data set that we can work with. Now a number of things can go wrong and we have to take those into consideration. So let's take an example of what this framework means. So let's say we are studying the population of young Finnish high technology companies. That is a conceptual definition. Then we need to actually define empirically or have an operational definition of what it means to be a young company and what it means to be a technology firm. So no one is maintaining a list of technology companies. So we have to operationalize that concept to a way that we can actually get data for. So the sampling frame would be, for example, business ideas. So registered corporations, that is not the same thing as a company. So one company, one organization can have multiple business ideas, but we have to have some kind of operational definition that we can actually get data for and we can get data for business ideas or legal entities behind these companies. So let's define young technology companies as companies that are 0 to 3 years old and are on certain industry codes, for example, 62 or 72. Those correspond to information technology industries. So that is our operational definition and that allows us to get a list of actual companies. Then we get a sample. So let's say that we get a thousand firms randomly selected on a list of maybe 10,000 companies or 5,000 companies, whatever the sampling frame is. The reason for taking a sample here that we email is cost. So whenever we email or mail, the address acquisition costs some money or some effort and if we mail physical letters, then they are printing costs. Then we get actual data, for example, 10% of our informants that were invited to participate decide to respond to the survey. So what can go wrong with this kind of thing? And there are multiple things. The relevant question with the sampling frame is that does our operational definition of the population match the conceptual one? So does the frame match the population? Then we have the second question is how large is sample size? Basically this is randomly chosen. Then the only thing that we can decide is how many observations we get. So it's 1,000 enough. And when we plan for sample size, we have to take the expected response rate into consideration. So if we expect a 10% response rate and we need 500 full responses to our analysis, then we should send out the invitation to 5,000 companies. So we would have 5,000 randomly chosen firms instead of this 1,000. Then the most problematic part is that the people who are companies who decide to respond may not be randomly chosen. So if we have out of these 1,000 companies that were invited to participate, if random 10% respond, that only means that we have inefficiency. So increasing the sample size would make our results, our estimates more precise, but that's it. A more problematic condition occurs if these 10% are chosen systematically. And that leads to biased results. For example, if our survey was about innovativeness and those companies that are more innovative are more likely to participate, then any regression analysis involving innovation as a dependent variable would produce biased results. Let's take a look at why that happens. So this is a classic example from Berks paper 1983. And he's demonstrating that there's a relationship between education and income, such that when education increases, then income goes up as well. So it's a linear relationship here. So what will happen if people who get low income either don't provide data or these people who don't have much education simply decide not to work. So let's set a barrier here. So no one about this point actually provides us data or below this point. So if we eliminate that data here, what will happen to our regression estimates? There are two things that will happen. First of all, all the regression results will be biased, because now we are fitting the regression analysis to these data here. And we are cutting these data, these observations that produce negative residuals. These are negative residuals because they are mostly below the regression line. So we have negative residuals here and positive residuals here that are included. Then it pulls the regression line up. So the regression results will be biased. Then we have no idea of what's the effect like in this group with low income people, or low education people here. And also for those people for whom we have the data, then we have biased results. So if our sample is selected systematically based on the variable that we study, then our results will be biased and the magnitude of the bias can be great in some instances. So this is not just an academic concern. I will next demonstrate a couple of our examples. So there is this widely known business book called Good to Create by Jim Collins. And it has sold millions of copies and provided inspiration for lots of managers. Also it received great attention in Finland when it was first translated to Finnish. So many people think this is a valuable book. And how was the book written? Well, there is a slight problem. It's presented as an academic study and it kind of is. But there are methodological problems to this book. So in the book they basically chose a large number of good companies and they followed based on some accounting measures. They followed those performance of those companies for 40 years with the research team. And then they found 11 companies that were initially good companies and then they became great companies according to the definitions that these authors here used. So Jim Collins is the first author but he had a team of researchers helping him writing the book. So they chose 11 companies that performed extremely well and then they studied what made those companies perform that well. Then they asked later on why did these companies perform better than others and then they wrote a book about it. So the problem with that is two things. First of all, if you choose companies that happened to be great in the past then you are sampling on the dependent variable. And if a company happens to be good for a chance reason it will get selected or at least some of these companies could get selected because of chance reasons. And then when some other researchers looked at these companies later on the next 15 year period only one out of the 11 were great. So we can just attribute these, chose of these 11 companies to chance explanation. Also what happens is that when a company is performing well then people start to attribute that performance to something that the companies did. So that's called the halo effect. And when you identify companies that are doing well and then you ask those people to evaluate why are these companies doing well then people answer well they're doing well because of something that they did in the past. And it's also possible that these companies just happen to be lucky and the fact that only 11 out of, only one out of 11 stayed great after the 15 year period under study just underlines the point that that's the likely explanation. So these happen to be great for reasons unknown and then people attribute the greatness to something that the companies did. So this design does not provide evidence of causality. Let's take another example. This is from Morgan and Winshey book on causal inference. And they have this hypothetical college where entry to the college depends on the SAT exam. The American high school exit exam basically and a motivation score that is measured somehow. Motivation score and SAT score are weakly and positively dependent on each other and college entry depends on both of them. So here is the data and the data here is the SAT score and here is the motivation score. And these guys here were not accepted to the college and these circle guys were admitted. So the sum of the SAT score and sum of the motivation score determines who gets to go to this hypothetical college. So there is a weak positive relationship. We can't really see it here but it's around 0.1 correlation. But it's not visible to play in eye. What happens if we measure the correlation only from those people who got admitted to the college? We only observe those students who got to the college. There is a strong negative correlation here. We get a strong negative correlation because we only study those people who got admitted. So if you were the principal of this college, a smart principal would ask that does that reduce the result replicating also those students who didn't get accepted and they find that yes it does. So you will get the same negative result. This negative result has very little to do with the actual relationship between motivation and SAT score. Instead it's a function of how we selected the sample. If we choose the sample so that the sum of motivation and sum of SAT score must be more than a threshold or less than a threshold then you will get this kind of negative correlation just because of the sample of the selection effect. So this is called the selection effect and the outcome is selection bias. So whenever you take a sample unless you are careful that your sample is actually a random sample of the population under study then you risk having a selection bias in your analysis and the bias can be great. Let's take another really, really practical example. So I went to the building fair in Vanta a couple of years ago and there was this construction company presenting an idea called a container home. So it's a small home, the size of a shipping container and these can be built as condominiums. So the idea is that you can increase the density of housing by having these very small apartments and then they wanted to get feedback on the idea. So how was the feedback collected? So they had a polling station where you could indicate whether you agree or disagree with the idea that this container home is a good idea. And how it was actually set up is that you walk along a road here and the container home was on the side of the road. So you could choose to just walk by or you could choose to go in. Then you went in here, you went through the apartment to the balcony and that's where the polling station is. So what is the problem? Why could that produce a selection effect? Of course people who are not interested at all who think this is a stupid idea, they will just walk past the container home and they will never see the polling station which is behind the container home. So you actually have to show enough interest to go through the container home, walk all the way through the behind and then after you have seen the home then you present an opinion. The counter argument for this selection bias is that you only want to have people who have actually seen what it looks inside. But that's not as important as it's the fact that people who think it's a stupid idea in the first place will just walk by without providing any data. So this is an introduction to issues that are related to sampling and there are multiple different techniques that you can apply. These selection effects can be modeled and also you can do sampling in many different ways to increase your efficiency and to avoid the risk of selection bias. There are other sampling techniques. If you are a Stata user, the Stata has a separate user manual for survey data that discusses different sampling designs and here are some references that you may be interested in. The typical sample in a statistical book is a random sample and that is also what I will be covering on this course assuming that the sample is random simplifies things a lot. The second kind of sample that is very common is a cluster sample. So cluster sample refers to a sample where the observations are no longer equally likely to be selected. So random sample is defined as a sample where each observation in a population is equally likely to be selected. A cluster sample on the other hand refers to a scenario where you for example have to interview people at their homes. So if you do that and you take a sample of let's say all Finnish people, random sample from all Finnish households, then you will have to travel all over Finland to get your data. So in practice we choose a couple of cities and from those cities a couple of streets and we then sample people from those streets or just interview everyone on those streets. So we take samples from clusters. So the observations, if your neighbors are interviewed, then it's more likely that you are interviewed as well. So the probability of being selected is clustered so that if you live close to those people who are more likely to be selected, you are more likely to be selected as well. And that cluster sample causes some problems that we'll talk later. One way that we can deal with cluster sampling is called a stratified sample. A stratified random sample. So stratified random sample are concerned situations where you have for example uneven distribution of people or you have the cluster sample issue. Let's say we have a school with 300 students out of which 30 are minorities. In that kind of scenario taking a random sample of 50 students is going to likely produce you a very small number of minority students. So it makes sense to take a sample separately from the minority students and a sample separately from the other students so that you can get a sample that is better for your study. So stratification refers to first dividing the sampling frame into different strata or different sets and then you take a random sample for each set and stratification improves the distribution of your variables. It produces random samples that can be better in some instances and that's a very commonly used sampling design. So these are the three most commonly used sampling designs. Random sample everybody is equally like to be selected. Cluster sample means you choose people from certain areas which you choose in advance so that people in other areas have a zero chance of being selected. So that's cluster sampling and stratified random sampling means that you divide your sampling frame into different strata based on criteria for example race, education level and so on and then you take a random sample of each of those strata separately and that provides you some statistical benefits. Then we have the fourth type of commonly used sample called the convenience sample and convenience sample is none of that. Convenience sample is something that we just happen to get. In most cases if you do a survey study and you do send out invitations the people or organizations that you choose to invite may be a random sample but in the end those that you get data for is not a random sample of those who got the invitation rather it's a convenience sample just the companies that we happen to get. Convenience samples are debated to some extent so some people argue that they should be avoided some people argue that convenience samples are useful because they allow us to do designs that wouldn't be possible with random samples for example. But you have to understand these different concepts to understand issues that related sampling that I'll cover in later videos.