 to take a guess? Have you seen the crossword? Are you going through? Okay, filling? If anyone has thought of an answer, let us take a guess. So I have made a small change in three across in a population stroke sample. See that third clue across collection of data from every element in a population that we never do. So I have just changed it to stroke sample. So let that not mislead you by saying that we are going to collect from every object in a population that we are not going to do that. Did you get answer to any clue? Data. So data is four down. Then somebody said collections. So which one is that? What is the next one? What else? Anything else? So both are correct. What next? Yeah, somebody said something here. Yeah, effects. Which one? Five across. Okay, that seems to be okay. Okay, anything else? Yeah, sample, very good. Okay, three. Yeah, who said that? Okay, so that is three across, right? Okay, then only two are left. Sorry, themselves. Yeah, that is correct. Okay, now first one is easy. Yes, who said that? Raise your hand. Somebody said continues. Yeah, okay, good. Okay, good. So let us start with today's lecture. So we are going to cover some aspects of descriptive statistics in today's class. We will discuss some more organizational matter. We will talk about describing datasets, summarizing the datasets. In particular, we will look at mean, mode, median. And then we will do some Sylab calculations. Okay, this is what I plan to do today. Okay, how many of you have logged into Moodle? Okay, thanks. Very nice. Please write the wiki. Wiki is also wiki pages up. Has anybody written anything? Who wants to be the first? I have already created the wiki. Has anyone seen that? Yes? Yeah, sir, please start writing. And the survey will be given shortly. Actually, we came up with lots of questions. So there is a debate as to whether we give it in one go or do it in groups, you know, two datasets. What should we do? Give them all together. Because there are 20 to 30 questions. It won't take much time. It is just multiple choice. And it is going to be done in Google Forms. Right? You have already done anonymous surveys. So it is going to be done by your student representatives. Manas is here. Where is Manas? Yeah, Manas is sitting there. So he is going to organize this. He has done this in the past. You are familiar with this? Have you taken part in Google Forms survey? Yes? Manas, can you respond? Can you just explain what that is? So it is just like any other form. You need to fill it up. But complete anonymity will be assured. You do not need to fill up any personal details. And where will this be hosted? A link will be given to them. Is that clear? Okay. So this will be given shortly. So I still want an answer from you. Should we do this in one go or should we split it into two surveys? One go. Anyone for two surveys? Okay. Thank you. So we will spend the next tutorial over here in this classroom. All the 10 batches will come here. Okay. Who is in the first batch? Can you raise your hands? First batch, tutorial batch. First tutorial batch. Okay. How many are in the second tutorial batch? What about the rest? You do not know the batch? Not given yet? Okay. So there are going to be two batches. One at 9.30 in the morning on Tuesday. The other one at 11.30. So this information will be given to you and normally there will be 20 tutorial batches, 10 batches in the 9.30 slot and another 10 in the 11.30 slot. Okay. And we have 10 classrooms in the main building where this will happen. There will be two TAs for each tutorial class. Okay. All the tutorials will take place in the main building. That information will be made available to you. That will be the normal procedure but for the first tutorial we will do it here. Okay. I am going to do that twice because not all of you can come at the same time. If I am not mistaken, you have, you will have to do some other. Not everybody will be free at the same time for the tutorial. Right. So as a result it has to be done twice. I will do it on Tuesday. So what I have here will be done from the next tutorial our onwards. Okay. So we will use data. Okay. Of course in the later part of the course you would have generated your own data. That will be used. But suppose we want to use some data in today's class. I cooked up some data sets GPA. The next one you are measuring in the last 6 months. So I have got, if you notice I have got, this is an even number. There are 6 data points. The previous one I have 7. So I wanted to have one odd number and one even number. Depending on that some of the calculations will change as we will see shortly. So in data handling first thing is collection of data, viewing that means graphical means. Of course we should understand that some of the techniques that were developed were done long time ago when graphical viewing was not an easy thing. So as a result some simplifications were made. Some techniques may not even be popular now. But you may just want to be aware of that because nowadays we have sophisticated viewing techniques. We can see it beautifully in a computer. We will see some of that today. Filtering, handling, outliers. Okay. So on. These are some of the things that one may have to do while handling the data. Before we start computing, before we start making decisions you want to do some preliminary things. These are some other things. Okay. Frequency tables, are you familiar with that? Yes. Can somebody tell what it is? What is the frequency table? Anyone? What is the frequency table? So it is a data point and the number of times it occurs. Right? Supposing you put all the marks, then you say that people who scored 90, there are 5 of them. Supposing there are for example 700 students, you write the marks of each of the students separately. Okay. Role number 1, 88, 2, you know 92, whatever. You do the whole thing and then you say that in the number 88 has occurred 10 times. 92 has occurred 50 times. So then you have that is a frequency table. Okay. Value versus number of times it occurs. Relative frequency table, you can normalize this with respect to for example overall number. So it becomes normal. Then you give everything in fractions. This is, this will especially be true in case you are looking at some data points where some, the total number will make sense. For example you want to write down the income of you know each person in a group. Then you want to say what is the total income? You divide by the total income. Then you say that together if everything adds up to 1, it will make sense. Okay. Then that becomes relative frequency table. You can represent them in pie charts. You can do this through histograms. We will do some calculations here using Sylab. So this is what I have said. Frequency table, number versus occurrences, if large group them. What do I mean by large group them? Supposing there are let us say thousand data points in a sample. Okay. In bold days when you have to do these things manually, you would say that it will be very difficult for me to you know keep track of all of that. So you would say that let me group them. Let me convert them into intervals. Okay. Suppose the entire range is 0 to 100. Then you might say that let me collect all the people who are in the 1 to 10 or 0 to 9 and then 10 to 19. You will group all of them and put them together. So that is what I mean by group them. If you have smaller number of groups that means you put lots of things in one group, then what will happen is it will become easier to represent but you are going to lose some information. But if you use it in the computer nowadays, it does not matter. You do not even have to group them. Okay. And if it is normalized as I mentioned, it is a frequency table and I would suggest that you read about this from the textbook. Okay. It is described. There is not much to discuss there but I would want you to go through them so that you are aware of them. Okay. We will now discuss the measures of central tendency, mean, median and mode. So this is where we are going to spend the rest of the class. You can think of them as something like a single number to represent the whole data. Right. It gives some kind of an average. Okay. Because you have thousands of data points, then how do you represent them? So one of them is this mean, median mode. One of these are maybe all of them. Of course, we will also talk about variance. We will talk about standard deviation and stuff like that. But you can think of this as the first measure of some way of representing the whole data set by a single number. So what is mean? What is meant by mean in statistics? Anyone? Or raise your hand. I cannot see. I cannot hear. What is the mean? Okay. So normally when we say mean in statistics, we mean arithmetic mean. Is that okay? All right. And typically we study the mean of a sample. We never really calculate the population mean. Population mean means you take everything in the universe that you are studying. So that is population. You have collected a small sample and you do the entire calculation analysis on the sample and then hopefully you can say good things about, you can predict the behavior of the whole population based on the sample that you have collected, based on the analysis that you do on the sample. Right. So that is the usefulness of the sample that you do not have to deal with the whole population. Suppose there are n observations x1 through xn. What is the mean? Yeah. Sigma xi by n. Okay. And we denote it as with a bar on the top of the variable. X bar equals 1 over n into sum of all the values and n in this case is the number of observations. Okay. So this is straightforward. We can actually do some sila calculations. So I have just opened this window here. So GPA is, so let me just put 7.8, 8.3, 6.2, 5.8, 9.1, 7.2, 6.2, 6.2, 6.2, 6.2, 6.2, 6.2, 6.2, 6.2, 6.2. So these are the 7 GPA values I entered earlier. So mean, would you want to calculate them? Just calculate the mean. I will give you one minute. Just calculate the mean. So this is, actually these two are one and the same. So you take either one. It does not matter. But the first one is what I have given. Okay. So you have done that. What is it? 7.8. So let us see what sila does. You see that, so the command mean is inbuilt and then you can also do, let me see his plot. So here is the, so histogram plot. Is that okay? Yeah, it has grouped. For example, this 5.8 and 6.2. So this is 5.8. This is 6.2. This is 6.8 and so on. In the last ones, you have only one. If you want to know more about it, in any function in Sylab, you can say help. Okay. When I say help is plot, it comes and it gives you the description, gives you example and so on. Okay. So that was just to give a brief idea of some calculation. We will come back to Sylab when we do another set of calculations. Mean, it is straightforward. Just take the numbers. Just add them. Divide by the total number. Get the arithmetic mean. You get the mean of the sample that you have. We will present a case study, US economy. Case study on US economy. This is the truth. This happened. This was reported by ABC News. Who is counting? Okay. This is the title of the news story. It is mean to ignore the medium. Okay. We are actually now going to talk about medium. So here is the story. So Professor John Paulus, he writes an article in ABC News once a month. He is a professor at Temple University. This is the article that he has written. Okay. He is a mathematics professor. It came on August 6th, 2006. Happened on this day. Right. This story, it talks about a discussion that took place in the US Congress. What happened there? It says that the Republicans preferred mean while the Democrats preferred medium. Okay. To discuss the state of the economy. How is that? So Republicans claimed that the economy grew at a healthy rate of, healthy rate of 4.2%. Okay. They talk about the economy in the year 2004. This story appeared in, when did this story appear? 2006. Because that is when they got the complete picture of what happened in 2004. And so the Republicans claimed that the economy grew at the rate of 4.2%. The Democrats on the other hand claimed that the real medium family income fell and poverty increased. Okay. So what do you think? Anybody wants to make a comment? Does anyone want to say how this could have happened? Anyone? I see somebody raising their hand. Anyone wants to say anything about this? How this could have happened? Yeah. You want to say? The higher income groups, their income must have increased and the rest must have remained same or their income may have decreased. So the mean can increase but the median indicates that the poverty has increased. Very good. What's your name? Alankar. Alankar. Yeah. That is warranted. Thank you. Thank you. So that is precisely what happened. Of course, you could have also taken a guess from the fact that Republicans support something then whereas the Democrats are supposed to be representatives of lower income people. So in that sense also you could have guessed it. So here are some statistics. During 2004 real income of the richest 1% making $315,000 or more annually grew by 17%. So you see and then the income of all others did not grow at all. In fact, so this is this data are given by two famous economists. They went through this, they explained because if you see in the previous one, there is a debate. So who is correct? So the clarification came from these two economists and they say that the income of the richest people, richest 1% of the people went up by 17% and the income of all others did not grow. So true picture is the huge increases in the income went to those with already huge incomes. The highest scale if you just say income versus growth. I can always say if you look at the highest income, can you see all of it now? At the highest income the growth was very high but at smaller incomes you might have like this, like this, maybe some negative also. You have some negative zeros, very small numbers but finally you see it is like the CEO of the company has doubled his income. Everybody else got no pay hike, something like this because overall there was a profit, the CEO got 1% of the profit, so he is salary doubled something like that. Everybody else got no pay hike and coming back here minimum wage in the year 2004 was the lowest in real terms since the 1950s. Once again this was told by these economists. In real terms the minimum wage was the lowest in real terms since the 1950s, so which means that if you look at the smallest income and the growth you would see that it did not grow at all. In fact it went to the lowest if you look at the absolute values. So what is median? So here if you see the Republicans actually talk about mean, Republicans preferred mean. They said that the mean income went up by 4.2% whereas the Democrats claimed that the median family income fell and poverty increased. So the question is what is median? How do we calculate? I have one more case study also where they have used median. So median actually is quite an important one also although we do not seem to use it so much, we always seem to think that we could just take a mean. So how do we calculate the median? What is median? But somebody want to tell? For example here they talk about median income or for that matter if you want to go to, so for example you want to find the median GPA. How will you do this? Anybody any guess? So here is one answer 5.8, 7.2, somebody says 5.8. How many people want 5.8? Raise your hand. So people who claim 7.2 what is wrong with 5.8? It is not median. So what do you have to do first? So median means it is okay they have taken the middle point but before that what do you have to do? You have to arrange them. You have to put them in a, you have to sort them first, arrange them in an increasing order and then take the midpoint. Orders and numbers in increasing order and take the middle value. For example here this is what I have done. The same values I have done. Now it goes in the increasing order 5.8 onwards. Take the middle value. This happens. You take the middle value when you have odd number of points. I have 7 so I have taken the odd number. Let us first see if this is okay in Sylab whether Sylab gives the same thing. So we have this GPA and then median GPA is 7.2. It essentially sorts them, picks the middle number. So that is the median. Is that okay? What about miss bill? Let us sort them first. The first thing you have to do is to arrange them in increasing order. So what is the median now? Somebody said something. What value? 1670. So what you will have is you will have, in even number you will have 2 numbers that are in the middle. Take the average of those. Take the arithmetic mean of those 2 numbers. So here 1640 and 1700 are the numbers. You take the mean of those. So let us do that here. So bill equals 1530, 1640, 1728, 1750, 1700 and then 1600. Median bill is 1670. Now let us go back to your miss bill. Let us see the pros and cons of this. We have already seen some of the effects of some large numbers coming in under what conditions should you use mean? Who will prefer median and so on? We will now repeat that for this miss bill. You have this miss bill but suppose in this month 3 friends piled on you and ate in your account. Okay? So now your miss bill, in this month instead of being 1750, it has gone to 5000. Okay? So you have to tell your daddy what is the average that you are spending. Okay? So you will say that on an average I gave a total of so much money divided by 6. What is going to happen to mean? Mean will be as compared to the previous mean that is before it went from 1750 to 5000. Has the mean changed? Has gone up or gone down? Same? It has gone up. Okay? What about median? Okay? Medium is same, constant, excellent. Okay? I have left it because I thought that Sylab could do this. Okay? You can do that. You will find that mean is mean would have gone up but the median would be the same. So as a result the mean is sensitive to perturbations. Is that okay? Whereas and hence it is not a noise sensitive measure. I call it noise because this for example you are supposing your father asks you how come you are spending so much money? Then you have to say that look this is an unusual situation. It does not happen every time. I actually spend only so much. So you can think of that one data point as an outlier or something that has been caused by some unusual circumstances. Okay? It does not happen all the time. Whereas median is a noise sensitive measure. It could change a little bit but generally it will be about the same. In this case for example the median does not even change. Otherwise if I have taken for demonstration purposes the largest number I have changed it, extreme point. If as a result something moves a little bit it is possible for the median also to change slightly but not a whole lot. So median has a way of concentrating on numbers where there is a big grouping because if you arrange all of them you will see that if lot of people are you know if you go back to this data where the US income was given where we talked about $315,000. Suppose that most people are getting about $30,000 then if you just write the in the x-axis you will see that lot of people will come around $30,000. Then if you do the median you will count the numbers and say mid-number you will come smack inside that big group. In that sense this median will give the behavior of what most people do in a population or what the largest concentration will do in a population. Is that clear? Now while talking about mean and median I had a figure here. So this I have taken this from election.princeton.edu. Look at the first line, median EV electoral voting you can think of. Of course Obama may say it is election victory. But whatever that is, EV estimated with a 95% confidence interval the statistic used is median. By the way this was done by Professor Wong at Princeton University and then he gives Obama electoral say victory or voting. So this is the majority mark. He gives different things HRC withdraws, celebrity ad, RNC, debate 1, debate 2, debate 3 and so on. And finally the numbers that he gets are extraordinarily correct and used and of course with 95% confidence interval is also given and it is outstanding. So he used median because he did not want the noise to affect the data. So the predictions and so on he wanted to use median. There is in fact another case study that I came across. In fact some of you if you can spend time because this case study I did not fully understand that is the reason why I did not put it in the slides. As I was searching for different things this case study is on Wikipedia. So they have analyzed something like the Wikipedia pages of 100 senators in the U.S. And they counted how often the webpage of a particular senator got damaged by entering wrong information because you may wonder this is being handled by millions of people they must be changing things. And how correct is it? You know especially if you go to this if you do this Google search and you locate this Wikipedia on 100 senators. So this is a big scandal. So for example they said that this person one of the news item says for a particular senator this person turned 85 and he collapsed he was taken in the stretcher to the hospital and he died in the hospital. And when people verified they found that he was hell and haughty. So they have you know people write all kinds of stories. So when they write a wrong story immediately these people go back and correct it. And they say that it takes a median of I think 6 minutes to correct it whereas if you look at the mean it is one day. I would want you to see that see if we can throw some more light on that. So it is amazing that with so many people writing so many things so many people are writing it is actually surprising that things are correct. So it turns out that in celebrity pages the pages get damaged quite frequently. And these people constantly monitor that once again this story talks about a distinction between mean and median. So it will be nice if some of you can actually read about it and explain it. So we talked about only mean and median there is one other thing called mode. So mode is an upper or value that occurs with greatest frequency. In this case so actually in both the data sets there is no mode. Because if you see here there is no mode no number that is even repeated. So everything occurs only once. But suppose let me see nothing is very close. Suppose this person who has got 6.8 he finds that his GPA was not calculated properly. He goes back gets it changed to 7.2. Then what will happen is your this mode is 7.2 because that is the value that occurs greatest number of times. In this case it happens twice that is the largest is that okay. So in a unimodal curve distribution you will have only one mode. One that has the largest occurrence. If you have two numbers that occur in largest numbers then you have bimodal distribution. So it is possible to represent how the distribution is in terms of this mode. So how is mode useful? How is mode useful? Where would you think of a situation where you can make some analysis about this mode. So let me give this for example suppose an interesting let us say cricket matches happening in TV. And of course other channels are also showing some interesting things at that time so that at least some will watch their channels. So if you look at then if you plot number of people watching different programs okay you will see the mode to be the cricket program okay because most people are expected to be likely to be watching this interesting cricket match. Then you can say that one can how is this useful that was a question. If somebody turns on the TV and watch a new program what are they likely to watch. So it is likely to be the cricket match. So the mode actually tells you what is the most likely thing to happen especially when that happens with the largest frequency. This will happen for example you know supposing you go to a company take up a job and then you say that what are the things what is the salary you are likely to get. So once again you can draw some conclusions what is likely to be your average income what is likely to be your expenditure. So many of these things can be done using this mode that is why this mode is useful statistic. So I have come to the end of the class. Thank you for your patience thank