 Good morning all. So we will get started with our presentation. Outline for today's class, we'll talk about quartile, Shebi-Chev's inequality, and then the concept of what is meant by approximately normal. And then we'll talk about paired data sets. We will discuss relation between two sets of data. For example, things like the salary goes up, there is a DA announced, the earnest allowance. Then the price of, let's say, Brinjal, is it correlated, things like that. Next we will take up sample percentile. So the sample 100p percentile, for example, you would say 25 percentile. If you say 25 percentile, then p is 0.25, 10 percentile, then you would say p is 0.1. I'm sure some of these things are already known to you, but we will just go through this quickly. It's also in the textbook. So it is defined in such a way that at least 100 percent, 100p percent of data would be less than or equal to data value. So we are talking about when you have lots of points, which have been arranged in a particular sequence. So for example, you know median. Median is approximately midpoint, and that is known as second quartile. Quartile means 25 percentile, 25 percentile is known as first quartile, 50 percentile is second quartile or median. When you say median, we have already discussed in great detail that corresponds to about half the value, about the midpoint, when the data points are organized in increasing order. Then once you order it, then you go to 25 percent, means one fourth of the value, first quartile, median is second quartile, 75 percent is third quartile. I am sure you are aware of this. Many of the testing agencies give the results in terms of percentile. So this is the third quartile. So roughly speaking, roughly speaking if it is in between, take one higher, roughly speaking, because we didn't do this for median. What we did was, when we had lots of values, we had lots of values 7, 8. So for example in this case, so we said that this is the, whatever value is, supposing we had something like this, we took the mean, that is our median. So roughly speaking, that is why I said that if in between, take one higher, this is the example that has been given in the book. Take one value higher, which if I apply, I will be taking the larger value, but if it is exactly on one, take the midpoint of this and the next value. Mostly there is no confusion when you get the value exactly, but when you have value to be rounded, then there are some issues. In fact this point of percentile is not defined uniquely. There are many definitions and unfortunately in the book, they have discussed about only one definition and gone through it quickly. So what I will do is, in interest of time, I will just move on. It is discussed in great detail in this website cnx.org. So if you are interested, you can actually go through this. Some example is given, but I thought that I will just skip this because we are running out of time. If you want, I will give a link for this in the website. In yesterday's class, we solved the crossword puzzle and then we got something like box and whisker plot. Do you remember this box and whisker plot? So here it is. So I have given some annotations also of this. So what happens is, so this is the data set called zit, whatever that is. I took it from Wikipedia. I just added some notes. It is typically written from the, it is plotted from the smallest value. If you see here, if I click this, not sure whether you can read that. So essentially here it says, it says, can you read the note there? You probably cannot do that. It says minimum. So this point refers to the minimum. That is you have a data set. It belongs to some particular range, minimum to maximum. So this is minimum and if you see here, this is maximum value indicated with minimum, maximum. And then this is the first quartile. This is the median. This line corresponds to median. This corresponds to the first quartile. This is the minimum value. This is the maximum value. This is third quartile. So box and whisker plot is something we discussed yesterday. It is a way of representing an entire data set using a figure of this type. Minimum, maximum, first quartile, median and then these, the lengths of these lines also denote the range, the distance between these data points. So it is a convenient way to represent. So this is what we saw in the last class, but this is box and whisker plot. In the book it says box plot, but it is also known as whisker plot, box whisker plot. Next we will go to the next topic, which is Shebyshev's inequality. It essentially says that given lots of data, how many data points will be around within the sample mean? So for example, supposing you have a collection of data points, let us say x1 through xn and let us calculate sample mean, which is x bar and then calculate the standard deviation also. Let us call it s. We saw the formula for all this. Now let us define the symbol s of k. Now what is s of k? It refers to all elements of this data set within ks f dot denotes from, from the mean. So that means you have all xi from the mean x bar less than k times s. That denotes this k. So this k is any number greater than or equal to 1. For example I can put s of k to be 1.5, k to be 1.5. Then I will say that s of 1.5 refers to all the elements that are grouped around the mean over 1.5 standard deviation. Then if you have this, let us say that number of n of s k is number of elements in that group. So can we tell some number, can we give some estimate of this number? So here is the formula. Number of elements in this group s k over n, this n is same as this n. That is the total number of elements. Then that will be greater or equal to this expression which is greater than 1 minus 1 over k squared. In the book they have, it is written as k sub 2 but there is no sub 2. It should be k squared. So what I have here is correct. So the number of elements in this set, the group that I talked about will be let us say greater than 1 minus 1 over k squared. This is the expression for, this is the Shebyshev's inequality. So what does it mean? Let us do some calculation. Let us choose, suppose we choose k equals 3 by 2 in the above expression. Then we will get the same expression except in k I am going to put 3 by 2. In the place of k I am going to put 3 by 2. So you want to do this calculation. Let us put k equals 3 by 2 and let us calculate what will be n of s k. I want you to do this calculation now for k equals 1.5 and also k equals 2. k equals 2 means 2 standard deviations from the midpoint. What will that be? See if we can just do the calculation. Let us take the last one. Let us take the last expression. Do not worry about this. You know if n is very large we can say that you know these are very close. In any case this is going to be greater than this. So let us take this last expression, do it for k equals 1.5 and k equals 2. And assume that there are, assume that n is 100. Suppose there are 100 data points, what will be this n of s k? So what will happen if k is 1.5? What will be this capital N? So we are taking k equals 1.5 and n equals 100 and n is 100. n is the total number of points. You can take any number but let us suppose that we have 100. So what is n of s k? Yeah. Can you speak louder? Okay. 55.55 or 56 depending on whether you round it. So it says that 55.56 points will be in this group, inside this range. What will happen if you take k equals 2, 75? So this says if you take k equals 1.5 then 55% will lie within 25% of the mean. Then if you take k equals 2 then 75% will lie within 25% of the mean. Now in the book they have given. So is it okay? Do you have any question? Any question here? Okay. So I want you to think about how will you write this in Sylab? This is something we discussed in the class. Supposing I give some data set and say calculate this in Sylab. And I say that you have lots of points, let's say 100 points. And somebody has already coded this, somebody has already put this in a table and they want you, which means that you don't have to type that in. So it's already there, right? Then I say that use Chebyshev's inequality and calculate this number n of s2, okay? You have 100 data points that's already coded. So let's say that is in a vector called data. Data equals within square bracket 1, 2, 3, 4, all the way up to 100. You have all those data points. So how would you find this? N of s2, what is the first step? If you want to use Sylab, what is the first step? Yeah, mean. So you would say, what is the command? Mean of data, let's say m equals mean of, within brackets, data. Okay, that is the first step. Then what is the next one? What is the next step? See, here you need this s also. What is s? Standard deviation. How do you calculate standard deviation in Sylab? ST, I think it is ST underscore deviation, right? Is that correct? Does anyone remember that? What is the command for standard deviation? Yeah, it is ST underscore deviation. This is what we said in the last class. Yeah, we had actually discussed these grades, right? We said that people within mean, one standard deviation above will give AB, two above will give, or anybody above one standard deviation will give AA, and so on. All right, so we find that. Then what do we do? We have found mean, we have found standard deviation. What is the next thing? What command will we use? We want to collect, we want to locate the numbers, okay? We want to collect the numbers that are at a distance of, plus, minus, let's say 2s from the mean. We want to locate all of those. So what command will we use in Sylab? Anyone recalls? We use the command called find. So we will say find, within bracket, we will say that data greater than, data greater than, yeah, data greater than m minus s, and, and is given by ampersand, and data less than m plus s. So actually you have to put 2s. So 2 times s, you should not put 2s. It'll say 2s is unknown. 2 star s, so what it'll do is, it will collect all of that and put it in this, in this variable. If you don't specify a variable, it'll put it in ANS. Then you just find the length, length of ANS. It'll tell you what is the total number of elements in this answer vector, okay? So I would want you to try this, this problem I would want you to try. But the grades I'll put. So if you just go through this, the syntax will become clear. I would want you to try that, okay? So it is easy to do this calculation. Now, but it turns out this Shebyshev's inequality is very conservative. Because it is, it says that this, for example, it says this n will be greater than 55%, okay? In an example, for example, they have given an example with top telling, top ten selling cars for 1999. They say that for this, they actually found 90% of the data to fall within plus minus 1.5 s, okay? In the example they have given. In that sense, it is a conservative example, conservative bounds, okay? Okay, so because it is conservative, conservative means it underpredicts, okay? It'll definitely be, it's like saying that it'll be greater than two, but actually you have ten. When two is a very poor estimate. So tighter, you would look for a tighter estimate, then you would say, instead of two, can you say something like it'll be greater than five? When you know that the value is going to be ten, that there'll be ten points within that range. And somebody says it's going to be greater than two. That's not a very good estimate. So you would want a better estimate. And so in order to understand this, I want to define a new variable called rk, which refers to all points outside this range. Previously we said that it is a group within that mean. In order to explain this new inequality, I'm going to define a new variable called rk, which refers to things outside this, okay? By the way, this symbol rk is not defined in the book. So this is a symbol that I have defined to make it clear, okay? So sk means the things that are within, let's say, ks distance from the mean, whereas rk means things outside that set, okay? So greater or equal to ks, okay? So as a result sk plus rk will be equal to n, the total number will be n. So if this n of sk is greater than 1 minus k squared, then n of rk will be less than or equal to 1 by k squared, okay? Things outside that range will be less than or equal to 1 by k squared. See if this is okay? By the way, we didn't prove this. I would want you to read from the book. Proof is given in the book for this Shebyshev's inequality. Both the one we have seen now, until now, and the one I'm going to talk about, okay? This n of rk is, it's less than or equal to 1 by k squared, is the Shebyshev's inequality that we have seen so far. Is this okay? Are you with me? Yes? Okay, now, so there is another inequality which is a little more tight. It is called one-sided Shebyshev's inequality. So this is the inequality we got earlier, namely, this n, the things outside, the outside this grouping is one over case, is should be less than or equal to 1 by k squared. Whereas this one-sided Shebyshev's inequality says that it will in fact be less than 1 over 1 plus k squared, right? So what will happen if you put k equals 2? If you put k equals 2, this will be, what will this be? 0.25. So it will say that 25% of the values will be outside, okay? That is the things outside that grouping will be less than or equal to 25%. Whereas if you substitute the same for here in the place of k squared, this will be equal to 0.2. So it says that 20% will be outside this grouping. As a result, it gives a less conservative bound. By the way, this also is proved in the class in the book. So this is what I have said. For k equals 2, if you do the calculation, you'll get 1 by 4, whereas you get 1 by 5, okay? I would want you to read this. And so because of this, I say that one-sided Shebyshev's inequality is less conservative. Before I go to the next topic, I would want to talk about the quiz that is going to be held one week from now in the same classroom, okay? First quiz, I announced it on the first day, quiz one, so it'll be held, okay? It'll be held one week from now in the same classroom, sorry? Is the date correct? Thank you, thank you. Yeah, thanks, thanks. Is it okay? Date is okay? So we'll do this one week from now. What I wanted to discuss is how this quiz will be conducted. This quiz will be prepared by contributions from all of you, okay? So what I'm going to do is I'm going to open. If it doesn't work out, I'm going to set the paper, right? So what we are going to do is we are going to set up a forum in Moodle called Q&A forum in which you can upload your questions. Only after you upload one question, you can see other people's question, whatever other people have uploaded. Then from the questions that have been contributed, we will select some questions randomly and give them. And earlier this submit, the better chances of those questions getting included in the quiz, okay? And if we don't get good questions, then we will add some questions. Supposing we find only some interesting questions, then we will add our own questions. I will add my own questions, okay? But if there are enough good questions, it is possible that all the questions will come from the question bank that you will be created. So the quiz is on the problems, on the material covered in the first two chapters, as well as in the case studies that I presented. Anything you find, anything you think is reasonable. So every good question has a good chance of getting selected. And they are as random because we also have a procedure that we will ask our TAs to select them and then randomly select out of the selections. And there are 40 TAs. So it turns out that if you have more than 30 data points, there is a likelihood of things becoming normal and random selection will work and things like that. So this is the procedure we are going to follow. If there are not enough questions, then we will do our own. Any questions? Is it okay? Should we try? No, we should not try. Why not? Give the mic there. They think it is a bad idea. It's because it's not a conventional way, okay? It's an innovative one, but I don't think it works out. Okay, any other, anybody who will say this something we could try? What's your name please? Dhanashree. Dhanashree, okay. So what can go wrong? Supposing what I'm telling is, let's try it. The deadline is Wednesday 5 p.m. And we find that the questions are not there. You're not going to, you don't decide not to submit. I'm going to set the paper anyway. Essentially what you're telling is that I will have to set the paper, which I'm glad to. But what I'm saying is we are willing to do this extra word of trying this out. Let's try it out. If questions come, it is okay. If the questions don't come, I'm going to set the paper anyway. I'm not saying that there are no questions. So we are going to cancel the quiz. I'm not saying that. Okay, we can discuss anything else, but not this. So coming back to this, your contribution. Do you think it is too much work for you? So why would you not want it? Do you think some people will have an unfair advantage? You think so? What do you think? One is, you want to say, give your comments? I don't see any unfair advantage in conducting such a quiz. The point is it's a random thing and the weightage is not that high. So you're not bound to lose much as even if someone knows a question beforehand. And if you post a question, you know all the questions beforehand. So it's not a disadvantage, but rather an advantage. So okay, I will tell you where this will be an unfair advantage. Supposing there are only ten questions that are submitted. If there are only ten questions submitted, then suppose there are going to be four out of that, ten. Then obviously, there could be an unfair advantage for some people. Under that condition, we can do one of the two things. One is to say that there are too little. So we cannot really use that, okay? We cannot use that. I'll set a new one. I think that's the only solution. Because we were actually thinking of telling, okay, for those people who said those questions, we'll give a separate paper. We might still use their questions, but for everybody else. But then there are issues like their friends would be knowing it, and we will not know who their friends are. So there will be some problems like that. So we can say that, think about some number. If the questions are less than, let's say 200, okay? We can say we can scrap it. Is that okay? Is that okay? Yes, no. If there are less than 200 contributions, we'll say that, sorry, we'll not use that, okay? Next, I want to talk about the next concept called normal data set. So most large data sets are well shaped. They peak at sample medium. So this is a hypothesis. Then if so, such data sets are known as normal data sets, and the histogram that you get out of these will be known as normal histogram, okay? So essentially it is a bell shaped curve, normal data set. So what it says is that you have, so it is bell shaped, right? Okay, so this is the normal data set. Then you have, okay, this is skew, so it comes like this. So this is at one end. This is skew to the left. Similarly, you have skew to the right, in which case the tail portion comes to the right-hand side. So you think in terms of the tail. But this is just a nomenclature, and there is an empirical rule. This is known as empirical rule for what is meant by approximately normal. What is meant by approximately normal? With mean x bar, collected data set. Let its elements be denoted as x1, x2, all the way up to xn. The mean is x bar, and the standard deviation is s. Then the approximate, then the empirical rule states that if the data set is, the data set can be called approximately normal. If the following are satisfied, approximately 68% of data lie in within one standard deviation, 95% lie within two standard deviation, and 99.7% lie within three standard deviation. So, just to do some calculation. For example, example 2.5a in the book. So it talks about the scores, marks scored by students in a statistics exam. Using the, by the way, this is known as the empirical rule. Using the empirical rule, they do the calculation and find that using this. They calculate the mean, right? Using, let's say, Sylab. They calculate the standard deviation and find the things within some range. Count the total number and find the percentage of this total number, okay? And 68% as per the empirical rule will lie in this range. 95% in this range, whereas the actual numbers are 53.6 and 100. Of course, you can call it approximately normal or you might say that it is not close to approximately normal. So, whether you want to call it approximately normal or not will depend on how close your approximation is, okay? Is this clear? These are the estimates, what you would expect from here as per the rules. But the actual values are 53.6 and 100, okay? 68% in this, but actual number is only 53.6. 95% are expected, whereas it is 100%. Everything is in this range, okay? We come to the last topic. This is the sample correlation. This is what I mentioned, supposing a DA is given to the government employees. And what is the price of Brinjal in different cities, okay? Or you can say that, what is the DA? What is the Brinjal price increase in the last 10 DA increments? And see if there is a correlation between these two. Okay? Say, first rains have come, what is the increase in umbrella sales in the last, let's say, 10 years, okay? For example, unemployment data and strikes. And an example given in the book is daytime temperature and defective parts, parental income and students buying textbooks. In fact, there is another example given here. So it is years spent in school and pulse rate, okay? They essentially took people who are like in the 50s and said, let's find the pulse rate, okay? And then said, okay, how many years did you spend in school, okay? So there is a correlation between these two. Is there a correlation? If these are related, then how do we decipher them? Can we draw any conclusions out of this? So this is an extremely important thing. And very often a lot of our predictions are based on such correlations. If you look at economics, if you look at prediction of how the economy is going to happen, what is going to be the growth rate. So the government depends on this, because we don't have an exact model. These do not obey conservation principle. I cannot write a mass balance equation. I cannot write a momentum balance equation. These are exact. Whereas something like this helps figure out. For example, you want to predict how the stock market is going to behave. We don't have a definite model. So here, something like this is extremely important. So what is this based on? So you have x values, you have y values. We talked about two different things. Supposing large values of x are associated with large values of y. Larger DA means larger Brinjal price increase. Supposing there is a correlation like this. And also if small values are associated with small values of x or associated with small values of y. Then, if you find the difference between the value and the mean. These will have the same sign xi minus x bar and yi minus y bar will have the same sign. And as a result, the product will be greater than 0. That means we say that these are correlated. And if you sum up all of those, this is for just any one. If you sum up all of this, then the sum is likely to be large. So here is the sample correlation coefficient, okay? Look at, we sum up all those products divided by n minus 1 into sx, sy, standard deviation x, standard deviation y. It is this. And this r, what is the value of this r? Can you guess? It will be in what range? Yeah, it can be actually in the range minus 1 to plus 1, okay? For example, if you have, so this is let's say y, this is x. Then, this is positive. So then this is r is greater than minus 1, r is negative, okay? There are some properties of r, minus 1 to plus 1. You can prove this. If you have a relation of this type, then r equals 1. If you put a minus bx, then r equals minus 1, okay? This is true also for scaled quantities. In the book, for example, he talks about this pulse rate and number of years in the school. And he finds, he calculates that and he finds r equals 0.7, okay? So does it mean, does it mean, can we conclude? You understand this example? Does, can we conclude that if somebody goes to school, lots of years, they will have good pulse rate, can we conclude? So this book argues saying that it is associated, not causation. The reason is it says that it's likely the people who have gone to school are likely to be doing some kind of work. So maybe, as a result, the pulse rate is lower, okay? So it is, it is difficult to conclude this to be that if you do this, this will happen. You can at the most explain this as an associated. These things will possibly happen together. Okay, not one as a cause for the other one, okay? So some examples are given. It is possible to solve all of them using Sylab very easily. This example is, in fact, I have solved one of the examples where he gives that r equals 0.7 for pulse rate. I've solved it in Sylab. I'll post it in the module for you, okay? All right, so I want to now conclude the last bit that I want to say. Is some of you have agreed to do role play, have come forward. I would want you to meet me at 3 o'clock this afternoon. Anybody else who's interested also? I have already announced in Moodle, did I put it in Moodle? At 3 PM in my office. Now we are also contemplating giving some similar activities of this type, okay? So what, so this is one. The other thing is, I'm not sure how many of you would want to do projects and so on. Would you want to do a project? No, no, yes, no, okay. So there could be different things. No, we have not finalized on this. We are not sure whether we can do that. It'll depend on your participation also. So some of the, you want to say some about different examples? Just say that only. So the possible things that you could do in such an activity is maybe like design a game for a casino. Game that does not exist as of now, not copy paste again. In which like the dealer has the maximum chance of winning, but the customer, things that all the customers put together have a better chance of winning. A classical example of this is Blackjack. Other examples would be like coming up with small things like where a mode is more important than a median and vice versa. So you can yourselves come up with different kinds of projects that you would like to do. You could propose topics that I would like to work on this. It should not necessarily be like a big project or something. It can be like a two or three page report or a case study. That is also fine. Okay, we have not. So there are the other things that we discussed yesterday. We have been thinking about is that there are actually a lot of case studies. It is possible for you to report some. But at this point, as some of you fear, we are also absolutely not sure whether the biggest problem that we can possibly have is how do we evaluate the contribution. Is that a reason why some people said no? Is it the fear of evaluation that made you say no or is it something else? Why did some people say no here? We don't want to do those things. Is it mark or is it too much work? Which one? Too much work? Now, some of you immediately said no. You must have had some reason. Why? Experience in the last semester. In which course? CS 101. So what happened? So was it too much work or unfair grades? Both. So that is going to be the thing we have to discuss. So if we restrict this to 5%, is it acceptable? Yes, no. We should probably include it in the poll. But if large number, okay. By the way, we have put together a survey, and it is going to be in Google. Of course, right now it is in IIT Bombay, but this will be hosted on Google Docs, right? And then we would want you to take and we will use this data for further in the rest of the course. Okay, it will be an extremely good data set if we can create it. Okay, so I'm going to stop here. Next class is role play. The class after that is quiz. Thank you.