 Hi, I'm Zor. Welcome to Unisor Education. I would like to continue talking about statistics, and in particular today's lecture will be about histograms as one of the very important tools which is used in statistics. Well, this lecture is part of the course of Advanced Mathematics, which I present on Unisor.com. I do suggest you to watch this lecture from this website because it has a very detailed explanation and notes probably sometimes even better than whatever I'm talking right now because I was writing them and I was thinking a little bit more thoroughly. So anyway, I do recommend you to watch it from this website and in addition there are exams on the site and many problems, etc. Okay, so let's talk about histogram. Well, let's just consider that you would like to have some idea about how a certain random variable is distributed and you basically don't have any theoretical knowledge about how it should be distributed. Just as an example, in the doctor's office patients are coming in with certain illnesses and the first thing which they might do actually is to measure temperature. So you don't know how the temperature is distributed among people, how many are with high fever, how many with low fever. So basically you just don't know any kind of information about this. So you do have observations. So you observe the random variable, you had its values which it took during certain number of times and primarily we're talking about independent and unrelated to each other experiments. So basically we have exactly the same random variable which we are experimenting independently again and again and again. And based on the result of these independent experimentation with the same random variable, you would like to have some preferably graphical picture of how the probabilities are distributed among different values. Well, let's talk about a very simple case right now. Let's say in this particular case you have a random variable which you know something about. Let's say this is a dice and you know that there are only six values, the random variable which describes how the dice rolls so this random variable can have only these values. So that's much more than some people know about random variable. So you know beforehand what you are dealing with. You're dealing with random variable which takes these values. Now, again, let's say you don't know any theory behind how the dice is rolled, whatever it is, it's rolled and you don't know whether the dice is ideal dice or not ideal. So all you do is just experiment. You roll the dice many, many times and you have some results. So you have n experiments and you have results. Each experiment has its own result. Each of these are either one or two or three or four or five or six, obviously. Now, what can be done in this particular case? Well, one of the reasons which you might actually, you know, kind of apply to this particular case is, okay, I know that the random variable in n experiments took these values. Why don't we assign the probability of one n to each one of them? Since these are independent variables identically distributed, it makes sense to assign the value of one n to each one of them. Now, there are only six possible values, which means that there are many repetitions. I mean, if n is relatively large, like 100 or 1000, then you will have repetitions because there are only six possible values. Okay, you say, fine, whatever these values are, I will combine the probabilities of all of those which are equal to one and call it, let's say there are n lowercase one, and call the n over n an experimentally received probability of getting one. And similarly, I will count how many of them are equal to two and I will have this probability of two, et cetera, up to six. So, I have this as my sample probability distribution. How can I put some visual representation of this? Well, very easy. Very easy. You have a graph. Here you will have numbers 1, 2, 3, 4, 4, 5 and 6. And on this segment, you can put some rectangle with the height proportional to n1. This would be n2, this would be n3, this would be n4, n5 and n6, something like this. Generally speaking, obviously, they can be different. So, this is your graphical representation of the distribution of probabilities in this case. Well, it's relatively understandable what it is, and it's relatively simple thing. The problem is it's a very rarely occurring situation. More frequently, we don't know beforehand these values, and most likely, the distribution you are dealing with would be continuous distribution, which means all the values are possible within certain reasonable range. And let's go back to my original example about temperature of the people who are coming to doctor for a visit. And right now, what you can do is just to measure the temperature. Each one has certain temperature, and if your thermometer is reasonable precise, let's say it goes to, let's say it's in Celsius, so it's something like 37.2, right? So, you have a reasonable range, let's say, from 35 to 40 degrees Celsius of temperatures. Normal temperature in Celsius is something like 37, more or less, and in Fahrenheit is 98.6. So, our temperature should be somewhere around these numbers, not much higher and not much lower. So, you measure the temperature, and yes, you also have some repeated values. For instance, 37.2, if you measure 100 people, well, then maybe like 4 or 5 or 7 will be with the same temperature. So, basically, you can apply the same principle. However, that's not really the right thing, because you feel that it's not only these 100 values which you have received are possible values. You know that everything in between is also possible, you just did not have this particular case. So, let's say you have 37.2 7 times and 37.3 none at all. Does it mean that the probability of 37.2 is like proportionally to number 6 and 37.3 is 0? Of course not. So, what can we do in this case? Again, the reasonable approach is the following. So, you know that during all these experiments, the range of the temperature was like this. So, what you do is you divide this in certain number of cases. Let's say you have 100 patients, okay? So, you have 100 numbers, each one of them within this range. So, what I suggest you to do is divide this interval from 35 to 40 into certain number of intervals. Now, what number of intervals is a different question which we will address? But my first attempt was to divide it is, let's say this is 35 and this is 40. Let's divide it in 10 different pieces, something like this. So, from 35 to 35.5 to 36 to 36.5, 37, 37.5, 38, 38.5, 39, 39.5 and 40, okay? Now, having these 10 intervals which we will call bins sometimes, bin. So, having these 10 bins, 10 intervals, what we can do next is take all 100 layers which we have received of the temperature and what we will do is we will have a counter in each bin. And if a particular temperature, let's say 37.2, it falls in between 37 and 37.5, then this particular bin, I will increase by one. Then another temperature comes and I will increase another. And gradually, I will have some kind of a graph which will represent my distribution of probabilities and I can say that the probability for the patient to have the temperature between 37 and 37.5 is proportional to the height of this rectangle. As a result, again, we can do it proportionally, obviously. You don't have to build it up to 100 or anything like that. It all depends on the scaling, obviously. So, you have a reasonable scaling, some kind of scaling. And as a result, you will get a certain number of these rectangles positioned one near another. And the height of the rectangle will actually specify how many different values in this particular range really occurred with our random variable, which is the temperature. And this particular graphical representation is a histogram. Alright, so a histogram is a graphical representation of the experimental probabilities which you obtained from basically making independent and identically positioned experiments under identical conditions. Alright, so now the question still is open about number of bins. Now, let's think about this way. For example, you have 100 values. Well, if you put it in too many bins, let's say you have 50 bins, then you will have lots of empty bins and some bins will have 3, 4, 5. As a counter, which doesn't look like the right representation of the probabilities. Because whenever you have too little number of hits to a particular bin, then it becomes not really representative. You know the law of large numbers, right? The more data we have, the more precisely we can evaluate. So, if we have very few values within certain intervals, that's actually not good. It doesn't really make your understanding of how the probabilities are distributed any clearer. On the other hand, if you have too little number, a small number of bins, let's say from 100 experiments you have something like 4 bins, right? 4 bins will not really present you any kind of a nice graph, so you will have something like this. Well, it doesn't signify much. Again, in one case, when the number of bins was too large, we have too few hits per bin. If number of bins were very low, we will not have a good graphical feeling of how the distribution is actually looking. So, there are certain recommendations. Now, one of the recommendations is that the number of bins should be something like square root of n. Well, I mean, it's very individual and obviously you can agree or disagree, but more or less, if you have 100 experiments, like I did, having 10 bins seems to be reasonably appropriate. And from 10 bins you can really see the distribution more or less. Now, as the number is growing, the number of bins, as this one, seems to be a little bit too much. I mean, let's say you have 10,000 experiments, would you like to have 100 bins? It seems to be too much. 10,000 patients, you imagine. So, the temperature is still within the same range, but you will have lots and lots of bins and it doesn't really mean that one temperature will be... So, you have 100, so it's from 35.5 to 35.51. You need a precision to 100s, which is not really possible, right? So, it's not really practical. Then there is another formula, which gives you a little smaller number. It's logarithm by base 2 plus 1, which is more appropriate for normal distributions with a large number of data, which you have, data elements. Now, I just mentioned the fact that the distribution actually... This one is applicable more when the distribution is normal. Well, is the distribution normal? Obviously, you don't know. However, you might actually think about it this way. If you can imagine that certain random variable takes value, which is actually the result of many different factors affecting this value, like, for instance, human temperature, just imagine you have billions or trillions of cells in the human body. Each cell is like a little engine. It's working and it emits a certain number, a certain amount of heat. And all these little engines are independently working, or independently, nobody really knows, but anyone, each cell contributes its own small piece of heat. And it's the result of combination of all these little cells and the heat which is produced by them is the result which we have on the thermometer. So it's reasonable to assume that whenever you have lots and lots of contributing factors to something which you are measuring, something which you are experimenting with, according to the law of the large numbers, the distribution of this conglomerate random variable is relatively close to normal distribution. So in some cases, you just assume based on some general consideration that your distribution should be close to normal. In some other cases, it's completely unapplicable, obviously, and there are cases when it's definitely not normal. Then you probably should do something else. And again, something like this recommendation, or even this one as well, the number of experiments is large, is a good recommendation for the number of bins. Now, once you have established this, next question is to divide the range which you have, in this case I said from 35 to 40 into intervals of this number. Well, are intervals supposed to be the same size? Well, I have actually divided this way the same size. Each one of them is half a degree Celsius. Well, in most of the cases, that's probably a good idea. In some cases it might not, because in some cases if you will have a distribution, let's say very rarely you have minimum and very rarely you have maximum, and most of the values are really here in these values. So if you put equal size of each interval, then you will have 0, 0, 0, 1, 1, 0, 1, etc. And then at that moment you have a very steep wave of the numbers which fall into the lower, into the central values of this interval, which doesn't really look good, in which case you might actually reduce the number of intervals, combine them all into one. So it will be a bigger interval, but since in each smaller interval you have a very small number of values, in the bigger you will have something more reasonable as the number of values, and your curve might look a little bit better, something like this. But again, it's very individual, it depends on the experiment, depends on the random variable and its distribution. But generally speaking it will be fine if you just think about number of bins as something like square root of n or logarithm n by base 2. And if you will establish intervals of the same length like I did it in here, you'll be fine as well. In any case, your purpose is to get histogram, which means to get some kind of a graphical representation. And by the way, this particular, in this particular case you will have, I have basically some results from one of the websites, and the results are something like this. This will be 1, this will be 3, this will be 16, this will be 38. I'm talking about the height of the rectangle. Then you will have 18, 15, 6, and 1. 1 and 1 and 1. Okay, this is by the way the case I was talking about, when you have too few hits to these intervals, so you might as well combine them together. But in any case, you will see something like this. And this is very close to the bell shape, which is typical for normal distribution. And again, as I was saying, considering that the temperature is actually a result of some kind of a combination of millions of different factors, billions or whatever, which are happening inside the body, then it's not a surprise that this particular curve looks relatively close to normal distribution. And in any case, this particular graphical representation, this histogram gives you an idea about how your particular random variable behaves. And if you have an idea about the distribution, now you can actually calculate certain things, certain probabilities, averages, variations, etc., etc. Not that you can't do it without the graph, but the graph just really helps you. And it's a very, very popular way of representing information in the way of histogram. So I do suggest you to read all the notes to this lecture on unizord.com. And basically that's it. Good luck.