 Welcome to this lecture in which we continue to study the way we model random phenomena in simulation analysis. In this lecture we're going to look at the input distribution and how we determine what data we will use as input to the simulation study. What do we mean by the simulation input data? Even when we did our very very first simulation in class, we know that we could not have done it without knowing something so that we could build the simulation model on it. One of the things we had to know were the inter arrival times between successive customers and the service times for each particular customer. The way we did it we had a very very simple uniform distribution that if you were doing this by hand you could accomplish with a die or a bunch of chips or a deck of cards or any random device. In general how do you determine the input data to the model? How do you do that? Well if no other information and you're just testing your model why don't you start with a constant? The same inter arrival time the same service time at least you'll be able to build a model around it but you see what I mean by you need input data in order to build your model. Then number two you can make an assumption about the input distribution and its parameters because you've studied the field. You know what other people say. You know what theory says about the type of model that you're studying. Talk to experts. So there is something for you to base this on. You're not building a simulation model strictly out of thin air. Three you can use a historical data. If you have a model that you're simulating and it's there's a real system version of it collect data from it. You have a string now of let's say inter arrival times and just input that. You don't need to have a distribution when you have the actual variance the actual data and then finally and this is really the best way to do it of all you have a string of data take the data and use it to fit a distribution a theoretical distribution. Then once you have the distribution and the parameters that you get by estimating parameters from the data then you can you can sample from this distribution and you're not limited to the exact values that you collected from the real system. Why is this better? Well the real system data the data you collected from the real world is actually limiting. It's very constraining. You've got these numbers and you've got to use them exactly as is and you can never come up with another one. When you have a theoretical distribution that you have fit to your data then you can throw your data away and you have a lot more information about what goes on in your system because these theoretical distributions have been very well studied. Having said that how do we go about taking real world data and identify a theoretical data distribution from it? It's not the easiest enterprise. It's doable and this is something we should do and will do. The question you're asking is these data values that I got from the real world could they have come from some specified some theoretical probability distribution that we know of? So how do we do that? Well first we have the data either we collect it or we've got it secondarily from some other source. Second we take the data and put it into a frequency distribution. You create intervals assuming that it's continuous data you can put it in intervals. If it's frequency data and it works as frequency data use it as is with the frequencies as your histogram. We'll see very shortly how to work with discrete data and put it into a distribution and continuous data and put that in a distribution. So what do you do once you've created this histogram? You actually eyeball it. You look at it and you say I wonder if this looks like the curve of a particular distribution that's known that's theoretically known and has its properties known. So that's step three and once you've done that or at least you hope you've done that step four is to say well every distribution needs parameters very often the shape of the distribution changes depending on the parameters. So I'm going to use the data in order to determine to estimate the parameters of the distribution. So for example one of the first things we'll do is get the expected value the mean of the distribution. And then finally we'll use a statistical test to see if this data fits the distribution that we're looking at or if it does not and we have to go all the way back and try and start again. Well go back not to step one but to step two. What is a histogram? A histogram is a frequency chart. It illustrates graphically how the data is distributed across the possible values in your distribution. So the chart clearly from the image in the slide the chart will look very very different potentially depending on your the size of your interval. If your intervals are very very large the chart will be sort of like the image in the middle like V and you really don't get much information from. If your intervals are very small and you get a lot of bars like the top image like the top distribution that's the same data recall. It's just too ragged with a lot of missing spaces and it really doesn't help us come up with a picture and here we mean a literal picture of the distribution. And so in this case you can see that going somewhere in the middle where the intervals are not too large and the intervals are not too small gives you a better picture of what the data looks like. As I said we're going to see examples of creating histograms from data that comes from a discrete distribution and data that comes from a continuous distribution data that is discrete data versus data that is continuous data. And you see some examples there anything having to do with time is going to be continuous. Number of defects obviously anything that says number or number of jobs in queue is going to be discrete. On the next few slides we'll see some examples of data that was collected and graphed into frequency distributions in other words histograms. This one is a weekly production and we see that weekly production as the variable was organized into intervals that were they're arbitrary really could have been anything and they're basically 45 to 55, 55 to 65, 65, 75 and so on with the first and the last categories being everything below and everything above. We've got 120 weeks 120 pieces of data in and it was organized into this distribution. It has a decent shape it could probably be fit to any number of theoretical distributions but that's not the point. The point here is just to show you what continuous data looks like after we've constructed intervals and graphed it into a histogram in this case a vertical bar distribution. In fact this distribution looking at the picture looks almost like it could be a symmetric distribution close to the shape of the normal distribution and the data actually supports that. What's the mode? It's the bar with the high the tallest bar the one with the largest frequency and that's 96 to 105 with the frequency of 28. The mean we would have to add up all the 120 values and divide by 120 you can do that but just the easier would be figuring out what the median is. We start adding up the frequencies one plus one plus three plus seven plus eleven and so on until we reach a number greater than the halfway point so that we know that somewhere in that interval would be the value between the 60th and the 61st value in order and when you do that that's also in the 96 to 105 interval and that's the same one that has the mode. Well the median is equal to the mode maybe and very likely the mean will end up in there also but you'd have to do that on your own. All of those go into trying to decide number one what the theoretical distribution is that you could try to fit to this and number two once I do that what are my parameters going to be. Here's another example another set of data much of the work that was done in the previous slides is missing and left for you to do some of it is not missing and that and we'll talk about it. This is time to complete a task it is continuous data we note again that time data is always continuous and in this case we have 100 observations anywhere between 10 minutes and 80 minutes these are organized again into arbitrary intervals of 10 minutes each and you can see the frequencies and the relative frequency the relative frequency is going to become very important later on in this lecture. When you look at the frequencies out of 100 what's the mode same thing we did last time what's the mode it's the interval with the highest frequency 32 is the highest frequency and we're talking about the interval between 30 and 40 minutes oh look at that the mean is 37.3 minutes we might be on to something here what's the median the median is going to be the 50 percent point something between if we string all the ordered observations together and we actually that's what we have here in this grouped frequency distribution it's going to be between the 50th and the 51st observation and if those are both in one interval it's much less of a problem and indeed they are in this point and again we find the median is in the 30 to 40 interval that's interesting we've got the mean is equal to the median is equal to the mode well right away we're looking at distributions that are you know more or less normal ish where you have the something resembling a bell shaped curve finally here's an example of a discrete set of data a number of telephone inquiries that's definitely discrete and these are organized into one-hour intervals so we've got 509 pieces of data 509 one-hour intervals and out of those 509 hours 315 of those had zero telephone inquiries 142 of those had one and so on the graph is here you see the graph is very familiar to you and we could easily come up with the mean the median the mode and so on and we will but we're going to do that later on in this lecture because we're going to continue to use this this data set this distribution in order to work with it and see what would happen if we wanted to use that as input to a simulation later here we have another example of a discrete data which we've collected into a frequency distribution but not graphed for you leave that exercise for you as well as computing the mean the median and the mode take a look at this first it's definitely discrete our variable x is number of defects and we have values from 1 to 10 the frequency of each of those values is laid out and we've collected 350 of these observations in addition we have the relative frequency which eventually when we want to construct a probability distribution is going to be a big help to us we're not going to look at that now I'm going to leave the histogram and collecting the the statistics from this distribution to you you can see that the shape of your data distribution is critical is essential in identifying the probability distribution that you think it probably came from how do we assign an input probability distribution based on what we've seen in the data what before we even try testing for a fit we need something to test it to we need to have an idea hypothesis of the distribution that our data came from we could use theory for instance when we look at something like number of defects in a particular time period it's a discrete variable inside of a continuous interval that sounds very much like something that might be the Poisson or its inverse the exponential so we use theory as much as we can which means we also have to understand our system that before we even start modeling it we obviously look at we eyeball the shape of the curve that we've come up with we may rearrange the intervals in order to make the curve look a little different and look at it again so art is involved in this enterprise and then of course we look at the parameters sometimes to determine the distribution or the family of distributions and then in addition we will need it anyway in order to very often the parameters are change the shape of the distribution in the normal distribution the mean of the standard deviation affect the shape of the distribution in the Poisson the mean is equal to the variance that's going to affect the distribution and so on over the next several slides we're simply going to look at eyeballing the shape the from an artist's point of view the different some of the different more common distributions that our models will be asked to use this first slide has the ones that I think you may know the most about the uniform distribution is just a simple equally likely type of distribution it's it's better than using a constant sometimes though it's just a little bit better than a constant because if we don't know what the distribution looks like we sometimes use that as a first attempt and in fact you know that that that's what we did use as a first attempt in the very first class we had this semester on simulation modeling the normal distribution very well studied from our statistics classes we know all about it and that's another reason that very often it's one of the first blush attempts when you're fitting the input data of a simulation to a distribution the log normal distribution as we can see is a family of distributions that will change their shape depending on the parameter the log normal distribution from the theoretical description of this distribution that will give us an idea of whether we want to use it for our simulation project this typically is is thought to model a process that's the product of component processes so if you're using compound interest and your data is rate of return including compound interest well you might want to see if you can fit the log normal depending on the shape depending on the parameters the binomial distribution as we know is the distribution that models the number of successes or however you define success in a certain number of independent trials so for instance the typical example might be the number of heads that you get when you toss a coin n times so we have a on off success failure hit no hit variable and we have a certain number of independent trials the trials all have to have the same probability of success and we call that p an example in quality control would be the number of defective items in a lot of size n that's where the n is what's the negative binomial very much related to the binomial of the negative binomial models the number of trials that it takes to achieve a certain number of successes the little bit different it's kind of the binomial turned on its head which is why it's called the negative binomial we already know quite a bit about the Poisson distribution and the exponential distribution let me make the point recall that the Poisson distribution is a discrete distribution it's the discrete distribution of a discrete random variable and the exponential distribution is a continuous distribution and that's why the two pictures look different and that's why we don't take a distribution like the Poisson and just draw a curve from the top of each spike going all the way across uh it's a discrete distribution we've already worked with the exponential distribution we've done a lot with the Poisson but perhaps we haven't looked at it in this way as a distribution with this it's sort of semi-curve uh they're going to look different a discrete probability distribution will look different from a continuous probability distribution the exponential is a distribution for a continuous random variable the Poisson is a distribution for a discrete random variable and um we know that one is the inverse of the other we've worked with these extensively in our simulation models that we've been building um the interesting thing here you look at the graph of under Poisson and you're kind of saying wait why isn't that normal how is that different it looks just like a normal distribution or maybe almost a little bit asymmetry there but it's very symmetric the mean is equal to the median is equal to the mode it goes down seemingly asymptotically um towards zero it kind of looks a little bit like the normal distribution well uh we know that sometimes I was going to say often maybe often uh the Poisson is used as an approximation or rather the normal is used as approximation to the Poisson you've seen that in your intro stat course and the fact of the matter is that um the Poisson when the mean is large looks very much like the normal distribution that you see here when the mean is small it looks very different they're all sort of clustered around zero not in the middle like this in fact when the mean is small in a Poisson it looks closer to what you see as the exponential curve on the right side of the slide and as the mean of the Poisson increases and gets larger and larger and larger the picture the graph of the distribution gets closer and closer and closer to what we think of as a normal distribution um the interesting thing is though even if you're going to use the normal as an approximation to the Poisson or the reverse you have to remember they behave very very differently uh the mean uh in the Poisson is exactly equal to the variance that obviously is not true in the normal distribution and so the behavior of the two distributions is going to be quite different even if the sampling space looks very similar to the naked eye here you see several um probability distributions theoretical distributions uh that you may want to use if your simulation model uh or rather the system you're modeling has some elements that fit the theoretical description of what these distributions are used for or good for um the Weibull distribution is often used to model time to failure for component parts of a larger system so that seems quite useful for us uh the beta distribution is used to model random variables that have fixed upper and lower limits unlike the normal distribution uh the gamma distribution as you can see from the picture is used to model non negative random variables uh and the constant shifts the distribution from zero uh very much like what you might see in the very the the other distributions like the Poisson and exponential the Erlang distribution uh is used to model the sum of exponentials so if you've got a system um where you have exponentially distributed time to times to fail for a number of components and you want to know that the estimated time to failure for the system well you're going to be summing exponentials and the family of Erlang distributions is something that would be very useful for that type of model finally we've got some very very simple probability distributions these are almost as simple as the uniform although the uniform you know might be easier and simpler to to describe and to use you've seen triangular distributions I would imagine when you were using clouds or cloudists to build your simulation models because that was one of the defaults in cloudists triangular model you need a a minimum a maximum and a most likely and of course therefore the most likely is going to have a higher relative frequency and you're not so concerned about smoothing out the curve it's a slight improvement over a uniform distribution and the empirical distribution we've seen in a different lecture actually uses the data as is uh in order to construct the distribution from which to sample well here we are we have collected data we have um organized and graphed the frequencies of the data into a histogram so we can get a picture of the data and compare it to the pictures of some known theoretical probability distributions and then we pick one and we say okay I would like to determine whether my data can be said to have come out of this distribution actually that's not what we do the hypothesis we're actually testing HO the null hypothesis is that our data does not differ significantly from the theoretical distribution um if we do not accept if we reject HO if we say oh no the data doesn't support this we're rejecting HO what are we accepting we're accepting H1 that the data is different all right so you remember we're always setting up our null hypothesis as a straw man um and in trying to reject it and in fact in this case that makes a lot of sense because just because your data fits a theoretical distribution doesn't mean that it would not also fit some other similar theoretical distribution um all you can say with certainty uh then is that well I couldn't reject it it might fit this distribution which is very much an honest expression of what you're doing when you test for a fit one test the statistical test that allows us to do this is the chi-square test statistic and we have we use it in order to test hypotheses about frequencies or we could turn the frequencies into proportions pretty easily we look at our observed frequencies what we collected from the data and then we look at the expected frequencies from the distribution that we're studying uh so we can take our distribution and we can say well if this is the actual distribution uh that the data came from the data should have looked something like this and we could we compute expected frequencies or expected proportions and then we compare and we say well if they're exactly on the nose that that means that the the two the chi-square the value of the chi-square statistic should be close to zero and we'll see that pretty soon here you see the formula for the chi-square test statistic uh we've got f o the observed frequencies f e the expected frequencies and the difference between these frequencies squared so that we're not going to get a negative value of chi-square um that the difference between these frequencies squared is then divided by to get it as a proportion of the expected frequency or you might say relative to the expected frequency and that's added up over k values or k intervals um because each of our frequencies is over a particular interval or value so you know a few things about this number one the chi-square statistic is never going to be zero i'm sorry you're never going to be less than zero the minimum is zero and when is it zero well it's going to be zero if every single observed frequency is exactly the same as its corresponding expected frequency so if everything is right on the nose uh it's going to be zero um and that gives you an idea of what you're testing you're actually testing these differences the larger the chi-square statistic the larger the differences between the observed and the expected and the less likely it is that uh the observed and the expected both came from the same distribution the chi-square distribution is a series of distributions and each one looks different and has a different number of degrees of freedom for our particular test the degrees of freedom is computed as one less than the number of categories uh or or intervals or classes that we are using to test the fit um you know that when we do a test like this we're using counts frequencies these are this is discrete discrete counts the chi-square distribution is a continuous distribution it's a continuous approximation of a discrete distribution um this works most of the time but we have to make sure that the expected frequencies are large enough so there's an assumption here that our expected are not too small uh we usually use the rule of thumb that the expected frequencies should be five or greater and if you find that when you're you compute the expected frequencies based on the distribution that you're interested in if you find that some categories have expected of less than five uh just go back to your data and redo the intervals uh so that you have you combine adjacent cells and once you do that you'll have fewer degrees of freedom and you'll have to adjust that as well we're going to do a quick little example here uh before we get to a larger more interesting simulation example uh the distribution of the die we want to test the hypothesis that a particular die is fair and no one's cheating uh so we want to test to see if the outcomes one two three one two three four five six really do follow a uniform distribution so we're tossing the die we toss it sick let's say 60 times um and the results um will be on the next slide um and we're we're we're testing the hypothesis that the observed values that we saw in those 60 uh tosses follow a uniform distribution um and we're going to see that soon right now what you see is the beginning of the hypothesis test uh where you're writing down your null and alternate hypothesis the null hypothesis could be stated as there is no difference between the empirical and theoretical distribution with the alternate hypothesis being there is a difference in other words there is no fit the null hypothesis there is a fit the alternate hypothesis there is no fit alternatively you could specifically say maybe giving more information null hypothesis the random variable follows a uniform distribution and therefore the alternative hypothesis would be the random variable does not follow the uniform distribution so let's say that our alpha level is 0.05 which means we're allowing for a five percent chance that will reject the null hypothesis even when it's true here we're continuing with our hypothesis test to see if this die is fair if it follows a uniform distribution the data is laid out in the table to the left the values of the random variable one two three four five six um out of 60 tosses you see the observed frequencies there were eight ones 12 twos 10 threes 11 fours 12 fives and seven sixes naturally the expected frequencies out of 60 should all be 10 10 10 10 10 10 and they are uh and you we might as well continue the the calculated value of the chi squares computed in that same table as 2.2 you can see it all laid out there how the calculations work following the formula that we saw earlier in a previous slide for the calculated chi-square value the critical value from the the chi-square distribution is in the graph on the right side we're using alpha 0.05 so it's 0.05 in the tail we have five degrees of freedom because there are six categories so one less than that is five and from the chi-square table that gives us a critical value of 11.07 and what that means is if the chi-square value that we computed from the data is greater than 11.07 then we reject the null hypothesis that was on the previous slide that that the uh that there is uh fit and if the calculated value of chi-square is less than 11.07 uh then we don't reject and we say okay it could have come from a normal district from a uniform sorry uniform distribution and probably uh the die is a fair die uh the conclusion do not reject the null hypothesis let's get back to our example from earlier a number of telephone inquiries per one hour interval um so we have a discrete data distribution we've collected 509 observations 509 one hour intervals and the number of inquiries were distributed between zero and five there were 315 intervals where there was zero inquiries 142 where there was one um relative frequency is the the number of times an event occurs compared to the number of possible times it could occur and so basically it's a proportion so um if there were 315 intervals with zero inquiries 315 over 509 is 0.619 that's the relative frequency and the relative frequency for one inquiry which happened 142 times out of 509 is 0.279 and so on so now we look at this and we say okay fine I have my data I have it all nicely organized in a pretty table but what I really want to do is figure out what the input distribution is what the theoretical probability distribution is that this came from so that I could then throw this away use the probability distribution in order to generate input data for my simulation here's the uh graphical frequency distribution the histogram and like we said earlier you chart your data think about um the graphical representations of the various distributions that you looked at and you say to yourself hey this kind of looks like an exponential distribution so our hypothesis could be the random variable of interest here follows the exponential distribution and we could be testing this for fit using the chi square statistic we know how to get the expected for the exponential random variable the relative frequencies for an exponential are given by the formula you see there lambda times e to the minus lambda x where x is the value that you're looking for the expected relative frequency but there's some something's not sitting right here and part of what's not sitting right is the fact that we're looking at the formula for a distribution for a continuous random variable a probability density function and yet on the other side of the page we have a histogram uh that was created from data that came from a discrete random variable and in addition let's go to the next slide in addition this data seems to perfectly fit the theoretical description of a Poisson random variable it's a discrete random variable inside of a continuous interval number of telephone inquiries is discrete time a one hour interval is continuous that sounds perfect we should be looking at Poisson first before even looking to see how the chart fell out and uh it does make sense and let's see what happens it even the shape of the histogram um if you look at the Poisson with the appropriate um the expected value it will look a lot like that so let's try Poisson the probability of x number of events is given by the formula there what's in the formula we have lambda raised to a power of x we have e raised to a power of negative lambda we have x factorial so aside from e the constant uh we have x the variable and then lambda the parameter and lambda is the mean of the Poisson distribution we are we estimate it by getting the mean of the data that we collected of the observed um and in this case we find uh the data mean from the data to be 0.5147 inquiries per hour then almost home but not quite um the null hypothesis is that this random variable follows the Poisson distribution the alternate hypothesis is that it does not uh from the table the critical value is 11.07 just like before assuming we're using an alpha 0.05 uh degrees of freedom is still five same as it was before because we have six categories uh that's what we found from the data zero one two three four or five uh as the number of telephone inquiries we observed in it in these various time periods um so the critical value is 11.07 if if our calculated value of chi square is greater than that we'll reject the null hypothesis and we'll say that there is no fit um so look at the table we still have the columns of observed frequency relative frequency now we get the relative frequency from uh the formula for the Poisson distribution and using 509 as our n convert those to expected frequencies uh kind of like the reverse of the way we took frequencies and created relative frequencies from them but really just remember relative frequencies in are nothing more than uh proportions because their probabilities and frequencies are counts so the chi square statistic again is observed minus the expected square over the expected add all of those up for every one of the categories and we end up with something marginally rejectable we have a chi square value of 11.78 and so since we set up the test saying we will reject anything greater than 11.07 we must indeed reject the null hypothesis uh all is not lost let's continue take a closer look at this table uh the expected frequencies column we have um number of inquiries 0 1 2 fine number of inquiries 3 we have an expected of 6.92 after that 4 and 5 are so low uh that they we really violated the assumptions uh under which where this this chi square statistic operates uh expected frequencies of 0.87 and 0.1 are just too low so in the next slide you'll see what we do to fix that up all right the null and alternative hypotheses are not repeated here because clearly they haven't changed um the observed and the expected's changed because we've collapsed uh the categories with 3 and 4 and 5 inquiries um and so we end up with a chi square statistic calculated from the data of 3.88 uh if we look at the table the chi square table with alpha 0.05 the tail probability 0.05 now we have three degrees of freedom 4 minus 1 is 3 the critical value from the chi square table is 7.815 and we're we're the the calculator value from the data is smaller than that and so we do not reject ho which is good it means that our data fits um this particular uh Poisson distribution and we can go ahead and use it in the simulation