 Assalamu alaikum. Welcome to lecture number two of the course on statistics and probability. In the last lecture, I can wait to you the organization and structure of this course. In addition, I discussed with you the nature of the discipline of statistics, the importance of statistics in different fields and also I picked up some technical concepts such as data, variables, measurement scales and errors of measurement. Today, we will be discussing the various steps involved in a statistical inquiry and in particular, we will be discussing the methods of data collection, both primary data and secondary data and also we will discuss at some length the concept of sampling. Students, koi bhi research jo hum real life data bhi based karna chahate hain us me kon kon se steps involved hote. Actually, it's quite a scientific method and you have to have a methodological approach toward this problem. So agar main isko list karne ghi koshish karun to hum yoon kais sakte hain. For any statistical inquiry, first of all, you would be very clear about the topic and the significance of the study. You should be absolutely clear about the objective of your study, exactly what it is that you are trying to find out and then of course, the methodology for data collection, the source of your data and also the sampling methodology as well the instrument for collecting your data. The tool by which you would collect your data is of the utmost importance. Then of course, once you have collected the data, you would proceed to the analysis of this data. You will draw results and conclusions and finally, recommendations based on your study. As far as the methodology for data collection is concerned, I just said to you ke teen cheeze hain jenke baare aapko bilkul clear hona chahiye. Sb se pehle, source of your data, yani wo statistical population jaha se aap data collect keringi. And then the sampling methodology, kyunke aksar obeishtar aap ke paas resources itne nahi hote, ke aap poori population se data collect kar sakhein and you have to resort to sampling. Or uske baad also equally important, saath hi saath the instrument, the method by which you would collect your data. So, I will begin the discussion from the third point and that is the instrument and then I will go to the first two points the population and the procedure for sampling. Students, kisi bhi statistical enquiry mein ye jo collection of data hai, this is one of the most important steps. You see the point is ke saara jo aapka analysis hai, jo baad me aapne bade bade sophisticated methods bhi aap uske upar apply karne, agar aapka data hi sahi nahi hai, toh how can you expect your results to be reliable and valid. So, is zimne mein, the first thing to understand is that we can differentiate between two types of data that we could collect the primary data and the secondary data. Data that have been originally collected and have not undergone any sort of statistical treatment are called primary data, yani duse lafzo mein wo data jo aap khud collect karenge, fresh data that is primary data. On the other hand, data that have undergone some sort of statistical treatment are called primary data. So, we have to be concerned by statistical methods at least once, that is data that have been collected, classified, tabulated and presented in some form for a certain purpose are called secondary data. As far as collection of primary data is concerned, students, there are a number of ways of doing that. We have direct personal investigation, indirect investigation, collection through questionnaires, collection through enumerators and also there is a method in which we have data collected through local sources. Subh se pehle, direct personal investigation ki baat karte hain. Students, yeh wo method hain ki jis mein researcher ya investigator jo hain, wo data ko khud personally collect karta hain. Kisi aur peh rely nahin karta. Aur chunke uska apna zati interest us research problem mein itna zyada hota hain, is liye iss tra se jo data collect kiya jata hain that is generally quite accurate. But of course, as you can understand this method you know in which the interviewer is, I mean the researcher is doing it himself, you know this method can prove quite costly and time consuming when especially when the area to be covered is vast. However, it is quite a useful method for lab experiments or for localized inquiries. Lekin iss mein ek problem aur bhi hai, ke although the researcher may be wanting and thinking that he is being very impartial and objective you know sometimes it is possible ke uska jo personal, ek personal inclination hai, personal ek jo bent of mind hai, personal way of thinking jo hai jise hum log personal bias kahange you know actually that might actually enter into the data and so the data that way may not be extremely accurate. The next method that I would like to discuss is the method of indirect investigation. Tek ye baz oka at aisa hota hai, ke aap information jen logon ke bari mein hasil karna chahate hain you know they are hesitant to provide that information to you. For example, aap sab jaanthi hain ke income log apni income ek dum se aap ke saath share nahi karna chahate. Ya khawateen ke bari mein, aam baat hai ke jee khawateen apni umar aap ko nahi bataan chahate. So this is just a small example to illustrate the point that sometimes actually it is it may not be possible for you to obtain the accurate information directly from the person's concern. As a situation mein you would be interviewing some third party who has knowledge about that particular you know about that about those people and about that phenomenon and you will be collecting data from these third parties in an indirect way. Eise kahte hain, indirect data collection. Lekin is may be we have to be very careful because sometimes it is possible that the third party may give you wrong information about this person deliberately. The next method is the questionnaire method. Eise kahender hain aap ek sawal naama tayar karthe hain aur usko hain administer karthe hain apni respondents ke upar and they are required to answer all the questions and to fill out the performer. Western countries mein jaha pe aur un sab countries mein jaha pe level of literacy aur education humare developing countries ke nispat bohot zyada hota hain. Sab se zyada aam tarika ye hain ke by mail you know they would send the question hain by mail aur aap to aap ko pata hain ke email aur in sab cheez ho ka zamana hain. Lekin developing countries mein as you know when there are so many people who are not able to read and write. Generally you would have a trained enumerator who would go to this respondent and he would interview the person aur jo unka jawabat unke jo honge ye hamara jo enumerator hain wo form ko bharega according to the answers given by the respondent. So in this manner the next method that I listed the method of trained enumerators ko link up ho jata hain questionnaire method ke saad. Kyuke jab trained enumerator bhi jata hain to ek sabal naama uske bas hota hain jise wo administer karta hain on the respondent. In sab methods ke lawa ek aur method is ka main zikar kiya that is the method in which we have the data collected locally from local sources. Iska asan example krop estimation ka hain jab aap ko krop ki krop ke baare main daza karna hota hain to aap jo local sources hain jo gaon ka number dar hain jo waha ke local concerned log hain un se aap information munga laite hain. Ye to hui baat primary data ki wo data jo aap khud collect karenge aap karenge. The other kind of data is secondary data and as I mentioned earlier this is that data which has already been collected by some organization. Iske andar hain maare pass hain uske categorize kar dehte hain mukh talif kategories mein. We have official sources, semi-official sources, publications of trade associations and chambers of commerce and such organizations and we also have research organizations. Official sources ki category mein governmental departments such as the statistics division, the provincial and federal bureaus of statistics or semi-official mein semi-government idare. For example, the railway board or the central cotton committee iss khasam ke idare uske andar shamil hota hain. Students ye to hui methods of data collection ki baat. Aayi aap hain us baat pe jaate hain jo maini pehle shroo mein kahi thi. The source of our data, yani data collect kaha se karenge? The population, the statistical population and then also the method of sampling. Iske andar sabse pehle dekhne ki tis ye hai ke why do we have to resort to sampling? Jaise ka maini pehle kaha. Of course, if we could conduct a complete census, a complete count that would be a perfect situation, an ideal situation is liye ke aap apni tamamthar population ko exhaust karenge aur ek ek element of the population se data collect karenge. But the problem is that this perfect and ideal situation is not available in real life. Very rarely is it possible to you to have access to the entire population. More often than not the study has to be conducted on sample basis. You samaj liye ke ye jo subject hai statistics ka, iska goal hi ye hai. The goal of the science of statistics is to draw conclusions about large populations on the basis of data collected on sample basis. Aap hum ye jo loves population hai, isko formally define karte hai. As you can see on the screen, a population is the collection of every member of a group possessing the same basic and defined characteristic, but varying in amount or quality from one member to another. We have two types of population, the finite population and the infinite population. Let me explain this point with the help of some examples. For a finite population, you could think of the IQs of all the children in a school, the heights of those children, their weights, their blood pressures, their body temperatures. Kehne ka maksat ye hai ke chunke o school ke andar un bachon ki tadaad, ek finite tadaad hai. Le haza aap unse related kisi bhi variable pe agar data collect karenge toh that will constitute a finite population. The other of course is the infinite population. For example, the barometric pressure, there are an infinitely large number of points on the surface of the earth and hence we can have an infinite number of readings of barometric pressure. Isziman mein samajne ki baat ye hai ke bahut si populations itne zyada large hoti hai ke even if they are agar hum rigorous, bahut hi zyada rigorously define karen, so hum keh sakein ke it's a finite population. Lekin chunke bo iskhadar large hai isliye for all practical purposes it is equivalent to an infinite population. Students ye toh dono examples main aapke saamne pesh ki, they were of what can be called existent populations. But then we could also talk about the hypothetical population. The population of all conceivable ways in which a certain event can happen. For example, all possible outcomes from the throw of a die. However long we throw the die and record the results, we could always continue to do so for a still longer period students. There is one other differentiation that we need to make and that is between the sampled population and the target population. Sampled population is that from which the sample is chosen. Whereas target population is that about which the information is sought. For example, suppose that we desire to know the opinions of the college students in the Punjab regarding the present examination system. Our population will in that case consist of the total number of students in all the colleges of the Punjab. But if suppose that on account of shortage of resources or time we are unable to conduct a survey of this kind on all the colleges and we select only five colleges scattered throughout the province. Then our target population consists of all the colleges whereas the sampled population consists of those five colleges that we have chosen. Students, the next question is how do we draw a sample from our population? This is the very first thing to note is that we have two methods of sampling basically the non-random sampling and random sampling. Statistics ki jo tamam tar jo teri hai, statistical inference ki jo teri hai that is based on the assumption that our sample is a random sample. In order to draw a random sample from a population, students the first thing we need is the complete list of our population which is technically called the frame. For example, the complete list of all the BCS students of the virtual university of Pakistan as on 15th of February 2003. Students as far as the sampling frame is concerned it should be kept in mind that as far as possible our frame should be free from various types of defects that can occur. It should not contain inaccurate elements, it should not be incomplete, it should be free from duplication and it should not be out of date. I have just given you an example that on 15th of February 2003 the list of all the students of the virtual university. So, the defects that I have mentioned you can understand that if the list contains some student who had previously enrolled but has now left or if the list contains the name of the same student twice then these are the kinds of errors that we must try to avoid. And as far as possible the frame should be complete, accurate and up to date. Now, let us talk about actual sampling. From this frame we need to draw an example. The first thing to keep in mind is that a sample is only a part of this population and therefore it can only represent the population to a certain extent. It cannot represent it fully and the goal of sampling is that the sample should be drawn in such a way that it is a good representative of the population in spite of the fact that it is usually much smaller than the population. Students sampling has so many advantages from the practical point of view. You are able to save a lot of time and money simply because the bulk of work is so much smaller than a complete census. You are able to collect more detailed information about a number of variables. You have the possibility of follow-up. Who do you call follow-up? Students, when you collect data, it is not necessary that all the information you collect is accurate. It is possible that the question that your respondent sent back to you, he answered 9 out of 10 questions but he did not answer 10 questions. Also, you might have the answers but it might be very clear to you that there are some errors in them. In such a situation, you would be following up your query. You contact your respondent again and you try to have him rectify the errors. Obviously, if there is a complete census, there is very little possibility of follow-up because it is such a huge study. You cannot afford to follow-up on your respondent but if it is only a sample, many times it is possible for you to do so. Students, let me now discuss with you two technical concepts. The concept of sampling error and the concept of non-sampling error. Sampling error is defined as the difference between the true value of the population such as the true population mean and the similar value computed from the sample. For example, the sample mean. Yanni, as you can now see on the screen, in case of the mean, sampling error will be defined as x bar minus mu where mu represents the true population mean and x bar represents the sample mean. This difference called sampling error is due to sampling because a sample is only a part of the population. Hence, the value that you compute from the sample cannot be the same as that you would have from the entire population and this error would be there even if your sample has been drawn in a very correct manner. Besides sampling error, there is this other kind of error called non-sampling error because of sampling but which would be there even if you are conducting a complete census. The defect in the sampling frame, faulty reporting of facts due to personal preferences, negligence or indifference of the enumerator and non-response to mail questionnaires are examples of non-sampling error. Students, in the study based on sampling or sampling, note one more point. It is how long should the process of data collection be continued. Obviously, no such study can be done in which the process of data collection can be prolonged to an indefinite period. In fact, the longer your data collection takes, the more possibility there is of having variations in response because of the time lag. Hence, the procedure is that a definite cut-off date is generally established for any database study. Students, I told you a while ago that we can have non-random sampling or random sampling. Of course, my stress in a short while will be very much on random sampling but before this, I would like to discuss with you a little bit what non-random sampling means. Non-random sampling is that in which we would select the elements using our personal judgment. In this, we have different types and one of the most popular types is called quota sampling. Quota sampling is often used in commercial surveys such as consumer market research. For example, suppose that one particular company wants to know the opinion of the people about this new product that they have launched in the market. They may tell their enumerator, this one enumerator, that he should interview ten married women between 30 and 40 years of age living in this particular area whose husbands are professional workers and five unmarried professional women of the same age living in the same town, same area. In other words, this enumerator, he is restricted by quota controls. He has been given a number of men and women but after that, what kind of men and women that is totally up to his own discretion. Islié students, you can see that in this kind of sampling, there is no need for the list of the entire population out of which you would have drawn a sample using the lottery method. As such, it is obvious that this is a very, very convenient form of sampling. There is a lot of cost reduction and also a lot of saving in terms of time. But this is not the one that we have to go for. Islié ke jitna bhi wo samjhe ke jo mai selection karra hoon that is random. It is actually not really random. Students, you will be interested to know that psychological studies have been conducted which have established that the human mind is a poor random selector. Kehne ka maksad ye hai, ke for example, suppose ke mai aapse kahun ke 1 se 20 tak aap randomly number bolte jahi. So, you might say 1, 3, 2, 7, 12, 19 and you think that this is totally random. Lekin aap jab issi data ko aap analize karenge to aap dekhenge ke aap ne bohot zehada number jo bolhe, they were odd numbers and they were very few even numbers. Same is the situation in case of non-random sampling. The numerator may be thinking that he has been able to interview all kinds of men and all kinds of women in this interview regarding this particular product, but it is possible that those people are not really a very, very good representative of the population. Of course, there will be situations when it will be better to resort to non-random sampling. For example, suppose that the rector of your university wants to send one of the students of this university to some foreign country for this particular conference. Obviously, in this case he should use his best discretion to select a student that who will be able to represent the university in a proper way. And obviously, this student will be one of the most intelligent and good students. But the statistical point of view is that this student will not be a proper representative of the entire population of this university. So, the total population of 2000 to 2500 is present among all the students. Some who are very intelligent, some who are not that intelligent. So, this is the key concept as far as sampling from the statistical point of view is concerned. A sample student is supposed to be a miniature replica of the population. Alright, let us now focus our attention on random sampling. Students, random sampling is the one in which you select your sample by the lottery method. This is the simplest way of saying it. In this category, we have a number of types of sampling such as the simple random sampling, the stratified random sampling, systematic sampling, cluster sampling, multistage sampling and so on and so forth. In this course, I will be focusing on the simplest kind of random sampling and that is simple random sampling. Simple random sampling is the one in which the chance of any one element of the parent population to be included in the sample is the same as for any other element. Now let us see how we actually draw this sample. You know that the traditional method of the lottery is to write the names of the students and fold them and put them in a hat. You would pull out one of those jets of paper. Now as you know very well, better than me, computers are able to conduct the lottery method for you very conveniently by generating random numbers. I will discuss the use of the random number table. A random number table students is a page full of the digits from 0 to 9 which are printed on that page in a totally random manner. Actually these tables are constructed according to certain mathematical principles such that the chance of any one digit to appear on that page is equal to the chance of any other digit to appear. So as you can now see on the screen, these digits is such that they do not have a systematic pattern or order. In order to draw a sample from this kind of a table, I will explain this point with the help of an example. Suppose that we have the frequency distribution of the ages of a population of 1000 college students in any particular country. The ages are as you can see from 13 to 19 and the number of students in the various age groups are 6, 61, 270 and so on. Suppose that we have to draw a sample of size 10 from this population, how do we proceed? The first step is to allot a sampling number to each item in the population and to do that the very first step should be to construct a column of cumulative frequencies. So as you now see on the screen, the cumulative frequencies for this example are 6, 67, 337 and so on. Now that we have all these cumulative frequencies, now we are in a position to allocate the sampling numbers to all these population units. As the cumulative frequency of the first class is 6 students, what we can do and should do in this case is to allocate sampling numbers 0, 0, 0 to 0, 0, 5 to those 6 students who belong to this very first class. Now the cumulative frequency of the second class is 67, whereas the cumulative frequency of the first class was 6. This means that we can allocate sampling numbers 0, 0, 6 to 0, 6, 6 to the 61 students who belong to the second class. Similarly the cumulative frequency of the third class is 337, the cumulative frequency of the second class was 67 and we can therefore allocate to the third class sampling numbers 0, 6, 7 to 3, 3, 6. Proceeding in this manner, you obtain the sampling numbers from 0, 0, 0 to 9, 9, 9 as you now see on the screen. The interpretation of this column of sampling numbers is that the first student whose age is 13, he or she has been allotted the sampling number 0, 0, 0, the sixth student has been allotted the number 0, 0, 5, the seventh student whose age is 14 has been allocated the sampling number 0, 0, 6 and so on such that the last student, the thousandth student whose age is 19 years has been allocated the sampling number 9, 9, 9. I am sure that you are having a question in your mind right now and that is why have we shifted the number backward by 1, i.e. the first student, why have you given 0, 0, 0, 0, 1? The reason is that in all we have 1000 students, if we want to do the exact same numbering which is actual number, then instead of 3 digits, you will have to use 4 digit sampling numbers. The first student would have been allocated the number 0, 0, 0, 1 and the thousandth student would have been allocated the number 1, 0, 0, 0. Now you will have to do all the processes with 4 digit numbers and if I can avoid one number by one of those 4 digits simply by shifting the sampling numbers backward by 1, why shouldn't I do that? Now the next step is to actually select the sample. Sampling number is what we have allocated. Now I have to use the random number table, the one that you just saw in order to select a sample from this population. You may find this very strange, but the process is that you would simply close your eyes and put your finger somewhere on that page and wherever you land, you will get the number. A 3 digit number is exactly where your finger has landed. Suppose that your finger lands on the number 0, 4, 1. This means that the first student who has to be selected in your sample is the 42nd student. Remember 42nd student was given the sampling number 0, 4, 1. So the sample size is 10 and your first element has been selected. What do you do after that? You don't have to keep your finger again because all the tables are totally random digits. Now you can simply go down and you will get the next number and then the next number and then the next number and you will get your sample. So as you now see on the screen, suppose that in this example the numbers that you obtain are 0, 4, 1, 1, 0, 3, 3, 7, 4 and so on. And accordingly the ages of the students that have been selected are 14, 15, 16 and so on. In this example, we are dealing with the ages of the students. And suppose that our first objective is to find the mean age of these students. Now the sample that we have selected, if we remove the mean of those ages, then that will be denoted by X bar and if we compute the mean age of the entire population, then that is equal to that is denoted by mu. The population mean age comes out to be 15.785 years whereas the sample mean age comes out to be 15.6 years. Students, these values I have computed in some way, we will talk about these details in a few lectures. So, you see that in this example, our sampling error means the difference between the sample mean and the population mean that comes out to be minus 0.185, it is quite small. And this sampling error came out to be so small because of the fact that it was a proper random sample. Whenever you do random sampling, the probability will be high that your sample is a miniature replica of the population. Students as stated earlier, there are various other types of random sampling such as stratified sampling, systematic sampling, cluster sampling and so on. As I mentioned, we will not have the opportunity to discuss all these different designs in detail in this particular course, but I would like to present to you a brief description of the concept of stratified sampling. The first and foremost point to be noted in this regard is that the procedure of simple random sampling the one that I just described is appropriate in that situation when the units contained in our population are similar to each other with respect to the object of our study. In other words, simple random sampling is appropriate when our population is homogeneous. In contrast, there are numerous situations where the units of the population under study are not all similar to each other. In other words, the population is not homogeneous but may be regarded as a heterogeneous population. Stratified random sampling is a type of probability sampling which is suitable in this kind of a situation and in this situation the population is divided into relatively homogeneous groups technically called strata. Students, let me explain this point with the help of an example. Suppose that a study is to be conducted regarding the advertising expenditures of the 352 largest companies in a particular country. Suppose that the objective of the study is to determine whether firms with high returns on equity that is high profitability spend more on advertising than firms with a low return or a deficit. Now since there is a considerable amount of variation between the 352 firms with respect to profitability therefore in this situation we cannot say that the population is homogeneous and we would rather divide the firms into various groups in accordance with their profitability level. The grouping being done in such a way that the companies that fall within a group are relatively homogeneous. Suppose that it was decided that the 352 firms should be divided into five strata as shown in the table that you now see on the screen. Stratum one consisting of those companies whose return on equity is 30 percent or higher. Stratum two the companies having return between 20 percent and 30 percent. Stratum three companies having return between 10 percent and 20 percent. Stratum four companies having return between 0 percent and 10 percent and Stratum five companies who have to suffer a deficit. Suppose you can see on the slide out of the 352 firms eight fell in the first group 35 in the second 189 in the third 115 in the fourth and only five in the fifth and last strata. The concept of stratified sampling is that once the population has been divided into various strata a simple random sample should be drawn from each stratum. In this regard the first question that arises is how large a sample should be drawn from the i-th stratum? The answer to this question is that stratified sampling can be done either by the method of proportional allocation or by the method of non-proportional allocation. Now the allocation is said to be proportional when the total sample size small n is distributed among the different strata in proportion to the number of units in those strata. In other words the allocation is said to be proportional if small n i is equal to capital n i multiplied by small n over capital n and students in this formula small n i is the i-th stratum sample size capital n i is the population size of the i-th stratum small n is the overall size of the sample and capital n is the total size of the population. In this particular example capital n 1 is 8, capital n 2 is 35 and so on and of course capital n the overall population size is 352 therefore applying the formula that I just presented small n i is equal to 0.142 times capital n i. Applying this formula on each one of the five strata that we have in this example we obtain small n 1 is equal to 0.142 times 8 and that is approximately equal to 1. Similarly, small n 2 comes out to be equal to 5, small n 3 is 27, small n 4 is 16 and small n 5 is equal to 1. This leads to the table that you now have on the screen and as you can see this table shows that if a total of 50 firms are to be selected for intensive study then one firm with a level of profitability of 30 percent or more will be included in our sample, five firms in the 20 to 30 percent stratum would be selected in our sample at random and so on. So, students this is the procedure of stratified sampling by proportional allocation. Yani jo proportion overall sample size ka overall population size ke saath banta hai, wohi proportion har stratum ke andar maintain hota hai. Students regardless of whether the stratified sampling has been done by proportional allocation or non-proportional allocation the important question is what is the advantage of this particular type of sampling. The answer to this question is that stratified sampling ensures that every group present within the heterogeneous population is represented in the sample. In the example that we just discussed it is worth noting that only two percent of the firms fell in stratum one and only one percent in stratum five. Agar hum simple random sampling karthe poori population me se ekhi martabha simple random sample draw kar lete to aan mumkin tha ke in doh strata me se koi bhi company hamare sample me shamil na hoti. But a stratified random sample ensures that at least one firm from stratum one and one from stratum five are represented in the sample. In other words in such a situation stratified sampling has the advantage of being able to reflect the characteristics of the population more accurately than simple random sampling. So, this was a brief introduction to the concept of stratified sampling. But students in this particular course all the techniques of inferential statistics that we will be discussing in the forthcoming lectures they will be with reference to simple random sampling the one that is applicable in the case of a homogeneous population. Students this brings us to the end of today's lecture and I would like to encourage you very much to study in some detail the concept of sampling. Best of luck and Allah Hafiz.