 That's very nice. So this morning we will start by singing a joint morning song together No, we will not we will do something new today today we have produced and I hope you have all received a couple of sheets with colors like this Should you be colorblind then this is the red This is the green and And this is the yellow of course you wouldn't be able to know now During this lecture and also in some of the following lectures We will try to or I will try to give you some very simple questions and in order to Get your answers We have introduced these color sheets and there will be a Answer associated with each color and then you simply have to show your true colors So that's the idea With this little Game here in that way I Can get an impression on whether you are able to answer these small exercises correctly And then we can talk a little about the exercises if it seems that there are any problems in solving them so this is a New idea on how to try to get a little more in section Between you and me. We still have some late-coming buses I really have to complain with the public transportation system here Please take a seat at the last lecture. We were looking at the theory of probability and In this lecture, we will take the next step and we will go into what we call descriptive statistics and I will start by the usual overview of what we are doing within the set of lectures in this course and Then I will introduce what we call numerical summaries We will Talk a little about what we call central measures Dispersion measures and then there are some measures. We call all the measures. This is a compliment of the two first and Also measures of correlation after that We will have a look at Graphical representations There are many different types of graphical representations, but the ones I have chosen to provide to you Are probably the most commonly used namely the one-dimensional scatter plot histograms quantile plots a special plot which is called the turkey box plot and Then qq plots and turkey mean difference plots So this is more or less the contents for the lecture of today But let's start with a small exercise Okay So we can get a little training on using these color cards now this is a very simple question and it relates to the Theory of probability we had a look at in the last lecture The situation here is that you are considering whether you want to engage in a bit It's a bit involving throwing dices more specifically You have to throw one dice now you win the bet if you get a six and And In order to be able to appreciate whether you want to enter the bet Then you would like to know How many throws do you need in order to have an even chance of Getting a six in that way you can evaluate With your opponent How many throws You should be allowed to throw in order that the bet is fair Yeah, do you understand this? Situation How many throws do you need to throw with one dice in order to have an even chance of getting of getting a six now I've given you a couple of choices and You may show your colors the true colors this time. Okay? Okay Now I will spend the next ten minutes counting Okay, I can tell you I already have an impression here So I would say 78% are red more or less Half of the rest are yellow and half of the rest again Or the other half is green So I would say that it's very dominant red a little yellow and a little green That's interesting. I can tell you now that this answer was not correct and I would like you to talk with the Woman or the guy sitting next to you just for a few minutes and then try to discuss what could probably be wrong and Discuss it a little and then try to reevaluate your answer Okay, remember that you are allowed to use a piece of paper and also a Pen in order to try to get your sword straight Maybe we should try to To see how you have re-evaluated the situation Can I see the the colors again? Now don't give me the red one this time. Okay, and please don't be shy so Everybody have to show a collar Now I see a strong tendency. Hey, it's red. It's not allowed anymore There was a strong tendency of yellow there stills there's some greens also, but I Would say that it's dominated by yellow this time Let's say 60% yellow and the rest is green and that looks a lot better That looks a whole lot better because the correct answer is that You need to throw four times with the dice in order to have at least an even chance of getting a six and Intuitively, I guess that many people would make exactly the mistake with which you did the first time you answered and of course Most people also did not attend My lecture the last time but actually you did so you should you should Be able to understand the The way of looking at the problem which I've tried to illustrate here So what we have here is the probability that six belongs to the set of The outcome of n shows the dice is this this set. This is what I call tn and this we can also write here in the way of the probability that of the union that one at least one outcome of n shows with the dice is equal to Six now again This probability we can we can simply rewrite By saying that the probability of getting at least one six within n throws is equal to the probability of one minus the event of getting no sixes within in throws and That's a very convenient way of looking at at unions This is a union. It's very convenient To rewrite it in this way here it should be the union of No no throws being equal to six so what I've written here is actually not completely correct This is the intersection of the event that no throws will be equal to one six and here it's very correct one minus the Joint event of getting no six within n throws and getting no six within n throws is one minus one Divided by six one divided by six is the probability of getting a six So one minus one divided by six is the probability of getting no six in one throw Then we take it to the power of n Meaning that we are now looking at the probability of getting no six in n throws And then we subtract it from one. So this is a basic equation which we are Interested in now we need to determine in So that this probability here is equal to 0.5 at least Yeah, and That gives us that in must be Equal to four before we have that this probability is larger than 0.5 so Next time you throw dices and you want to enter a bit then you know that In order to have an even chance of getting a six You need at least Four throws with the dice okay now as Usual I'm repeating because I really want you to appreciate this what we are aiming for is decision-making in engineering The basis for decision-making are risks Not only risks when we are playing with dices But risks associated with the engineering activities, which you are all going to be involved with and in order to evaluate the risks we need to be able to Evaluate the probabilities of the events which are relevant and The events which are relevant are the events which are associated with consequences So that's the basis for decision-making. We need to be able to evaluate these in order to Evaluate probabilities of events. We need probabilistic models In order to establish probabilistic models, which you are going to learn all the details of in this course we need to establish and estimate the models and a significant basis for that is Data and this is what we are looking at today now. We are looking at Data, which may be available from observations from real nature or from the laboratory or whatever and we In the very first step what we want to do is to be able to describe these data in a consistent way in a standardized form which facilitates that we can Assess the data and we can communicate the data to other people In a way where they immediately understand what we are giving to them So standardized and for this purpose we are introducing some standardized representations of the data in terms of What we call numerical summaries. These are simply numbers describing data or Graphical representations. These are simply graphs showing the data There's one extremely important Information or observation at this point in time and that is that what we call descriptive statistics Does not make any assumptions whatsoever in this way The picture what we what we are giving is is a true picture. There are no assumptions underlying and in that way it's More easy to communicate because we don't have to communicate a lot of the assumptions Which are somehow involved in our descriptions. We are simply Showing the data and nothing else looking at the Numerical summaries the first and Also Maybe the first type of measure you would Use as a characteristic of a set of data would be what we call central measures and I'm sure you have all heard about what we call the sample mean From your high school experiences The sample mean is what is denoted by this symbol here X bar The bar is very often used as a symbol for means And it's evaluated as the sum of the values of the observations We are dealing with in observations and we are simply summing up here the values of the observations And then we are dividing by N in the end. This is what we call the sample mean If we should only have the option of representing a set of data with one value The sample mean would probably be the most relevant for most situations There are a couple of other central measures which we will Come back to in the graphical representations, but I will just mention them here Because they are central measures and they they can Correspond to numbers which can be quantified the median Which is also related to what you will learn Namely the 0.5 quantile and Also the mode the mode is the most frequent occurring value In a data set and those can be obtained from histogram So we will see this a little later, but remember sample mean The median and the mode these are central measures They're also what we call dispersion measures and Again, I'm sure that you all have heard about the sample variance The sample variance is established by summing up over each of the data in the data set the difference between the Value of the ice Component in the data set and the sample mean value now this difference is squared and Outside the summation we divide by N This is what we call the sample variance now the square root of the sample variance Is what we call the standard deviation? Now the sample variance and the standard deviation respectively they indicate How much the data are dispersed around the sample mean value? Yeah In a way, this is a very low level learning you can say because these are some Numerical summaries some some variables which are introduced and it's really very useful Trying to memorize these things because we will be talking about mean values and Variances and standard deviations throughout this course And if you have to think too much, ah, what was it now this particular type of number then it's going to disturb you In the context later on so you can consider this Very very little information to be something which is really useful simply just to to learn what it is make a lot of calculations on your excel sheet and Through the exercises so that you don't really have to think about what was it now, okay? This is knowledge, which is very nice to have in a very ready and available manner Now the sample coefficient of variation Which we also often write as the cov coefficient of variation Is simply the ratio between the sample standard deviation and the sample mean value and again the sample coefficient of variation is and a very nice way of indicating the Variability relative to the sample mean then we all also have a couple of other measures indicating other characteristics of data in a data set and the first one is What we call the sample skewness You see here this almost Has the same appearance as the sample variance so we take each of the Values of the of the components of the data set and we subtract the sample mean value now instead of taking it to the power of two we take it to the power of three and We divide it By the sample standard deviation to the power of three and Outside the summation We divide it by in Now this is a measure of symmetry in a data set and finally the sample kurtosis Is evaluated like this instead of the power of three. We're using the power of four but otherwise again It has the same shape Now the symmetry of the data set is an indicator of whether we have many values to one or the other side of the sample mean value and The measure of peaking is is an indicator of How dense the data are? Gathered around the sample mean value Yeah, so these are characteristics Central measures dispersion measures and then this symmetry measure and peakedness measure Indicating how the data are located relative to one and another We also have what we call measures of correlation here you see what we call a scatter plot of Joint observations of of of data X's and Y's so each of those Points in this in this graph represents values which are observed jointly of a variable X and a variable Y So the variable X could be the temperature and The very variable Y could be the humidity So you can imagine that we have a lot of observations of temperatures and humidities over time and now we plot them up in what we call a two-dimensional scatter scatter plot and if the if the data points we plot in are located in a very very Dispursed manner as we see in this graph here then The observation is that there appears to be no strong or no apparent Dependency or correlation in these data So there's no order between the observations if on the other hand the data are Gathering up along a line or in a more ordered cluster in the plotter in the scatter plot Then we say that there appears to be some dependency between the observe observed values of the X's and the Y's and We also refer to this dependency through what we call correlation and That brings me to the so-called sample covariance which now remembering that we are observing values here of Two types of data simultaneously the X's and the Y's But the shape here of this functional form is somehow Similar to the functional form which we use to evaluate the sample variance So we are summing up again over all data points joint observed data points, so we have N X X's and we have correspondingly N Y's and Now we are multiplying the product between the difference of the individual observations of the X's minus the Sample mean value of the X's We are multiplying that with the corresponding values of the Y's minus the sample mean values of the Y's Summing up this thing these products and then dividing by N And this is what we call the sample covariance now if you try to look at this sum of Of the of this simple product here, then you can realize that this sum will get positive contributions in case of Data pairs which are at the same time low low or high high So if we have values of the X's and the Y's Which both are high or low compared to the mean values of the X's and the Y's respectively In that case, so let's see Let's let's consider the case where we have high values in that case the X I's would be higher than the mean value of the X's and If at the same time we have high values of the Y's then also this term here and the product would be a positive product a positive Value now so when we are multiplying we get something positive multiplied by something Positive and that is added up in the sum here So we get a positive contribution in case of high high and the same applies in low low where the value of the X's are lower than the sample mean value of the X's and The value of the Y is lower than the sample mean value of the Y's and that means we have negative negative Which in total gives a positive and that then also gives the positive contribution in this sum So high high and low low will give a positive contribution to the sum and therefore high high and low low combinations if the data have this tendency then we we get a High value of this sample covariance Now this can be normalized by simply dividing by the standard deviation the sample standard deviation of the X's and the sample standard deviation of the Y's and In that way we establish what we call the sample coefficient of correlation and Because we have normalized now with respect to the Standard deviations of the X's and the Y's this sample coefficient of correlation has values in the interval Minus one to one Yeah, so in the cases we have we have consistent high high and low low combinations then the sample coefficient of correlation would be Positive and it would be high so it would be Maybe close to one If on the other hand we have consistent a Combinations Where we have high values of X and low values of X then we get consistent negative contributions all the other way around consistent low values of the X's Paired with high values of the Y's in those cases we get always negative contributions To the summation and in those cases we would get We would get a sample coefficient of correlation, which would be closer to minus one so in this way the Sample coefficient of correlation shows how the data are jointly paired in terms of their values Yeah If you have any questions and that I think I have forgotten to say that in the first two lectures if you Have any questions at any point in time then please Get your finger up in the air so that we can talk about it So numerical summaries that was a piece of cake We're dealing with central measures the sample mean value the sample median and the sample mode and Don't worry. I will come back to the median and the mode The dispersion measures the sample variance Which indicates the distribution of the data around the sample mean the sample coefficient of variation? Which shows the variability relative to the sample mean value? The sample skewness which shows More or less to what side the data are located relative to the central values to the sample mean The sample kurtosis how the data are gathered around the sample mean the measure of correlation namely the sample covariance tendency for high high low low high low and low high pairs in two data sets and then the sample coefficient of correlation namely the sample covariance normalized a Value which is always located between minus one and one This is not really too many new Types of descriptors and it it would be very convenient for you to really become comfortable with these terms and also Be able to calculate these types of descriptors Without having to think too much. What was it now? Now? let's look at graphical representations and Again, I might be repeating something you have also already learned in high school, but never mind We just need to be sure that You remember these things First we are looking at some observations of traffic these are numbers observations of of traffic flow from the Swiss roadway network and As it is the case on the most roads we have traffic in two directions and In this table you see the dates where the data have been observed. So these are the dates of the observations and These are the observed numbers of cars on a given day in Direction one and Direction two respectively and in the first column You see the number of cars which were observed at these particular dates here oops here In the same and this this we call the unordered These are simply the data corresponding to the dates where they were observed now, of course if you have a Table filled up with some numbers You can always reorder them According to what you find convenient and appropriate and when we are doing Statistics, it's very often very convenient to try to order the numbers we are dealing with in descending order so that we have the We have the smallest value in the top and then We have increasing values as we as we go down in the column All the other way around so that depends on What we find is convenient, but ordering them Relative to to another as I've done here where we have increasing values as we go down and this I've done also here for direction two But the numbers are of course in this call in in in the order and unordered columns are exactly the same. They are just reorganized I'm not changing any data at all the simplest possible type of graphical representation we can do is what we call a scatterplot and here We have a one-dimensional scatterplot When we were talking about correlation and covariances We were dealing with a two-dimensional scatterplot the simplest Graphical representation is really the one-dimensional scatterplot and in this in this plot What we do is that we simply Provide an axis indicating the range of The values which we have in our data set so we see the range goes from 3000 corresponding to the smallest value in the two data set and Then it goes up to 12,000 cars per day corresponding to a value a little higher than the absolute maximum Value in the data sets and then what we do is that we plot in Along this axis the observations now that already gives us an impression of How the data are located in this in this interval and we can immediately Get a feeling for what are the central points in the two data sets The eyes are very strong in evaluating Such characteristics. So these are clearly Close to what we could call central points We can also immediately see which are the lower points in the data sets and which are the upper points in the data sets So you see it it gives a feeling for the data a much much more convenient overview of the data in a data set then the Simply by listing up the values in tables So it brings something visually simply to plot them up another very convenient a way of Showing the data in graphs is by using histograms In histograms. We take the data of the table here For the two directions And now we group them into intervals Like this so now I have taken The interval of the data and I have split them up into subintervals and these Values these ranges here are the subintervals. So this goes from 3000 to 3500. This is the first subinterval And then it goes in a equidistance Intervals all the way up to 11,500. So the last interval is from 11,000 to 11,500 cars per day now for each of those intervals We establish the interval midpoints So that's simply the geometrical Midpoint of the intervals which are listed in this column here in the second column And now what we do is that we go in and then we count up how many Observations do we have? In each of those Intervals so we count the number of observations here and we list them up in this column here Now in the second column what we do is that we Calculate the the frequencies So that corresponds to how big a percentage The number of observations in the each of the intervals Constituted out of the total So in order to evaluate that first we calculate up What is the total number of Of observations here and then we Take this number here of the from the column of the number of observations and we divide it by the total number Multiplied by 100 and then we have it in percentages So we do that for all the other And in the last column here we have what we call the cumulative frequency Where we are simply adding up The frequencies from the frequency column But now these are in absolute numbers and not in percentages and you see That the cumulative frequencies they add up to one so The next step is to So The next step is simply that we plot The frequencies And the community and the corresponding cumulative frequencies here in these plots The frequencies are plotted for the two different directions of the traffic And you see here No here Actually in the first one We simply have the number of observations And it goes up to 10 This is the maximum value in the column of the number of observations On the x-axis you have the intervals with the number of cars In the second graph here We have the frequencies of the observations again Plotted against the number of cars in the intervals This we call a simple histogram And this we call a frequency distribution Now The mode The mode The mode of a data set Is corresponds to the value Which is most frequently occurring in the data set Okay So when you go in here In the data set Then you can immediately in this histogram Then you can immediately you can you can see where is the mode located This is the interval or the data value Which is most frequently observed Now I think I will stop And let you have a break And let's meet in 15 minutes Ladies and gentlemen We should continue Now I forgot to tell you that When you're leaving The class today Then please give us back these color sheets Because we're going to use them again Recycling And Before we continue I would like to Make you aware that This formula which I showed you just a few minutes ago It Included here For this sample covariance It was written to the power of two Please correct that Delete the power of two In that way it will be consistent With what you find in the lecture notes Okay So that was a small mistake here in the Overhead In this way here it's correct Now we have seen the simple histograms And we have seen the Corresponding Distribution of Frequences Now the last thing which we can do with these Data organized into intervals Is that we can plot up the Cumulative frequencies And we have them in this Plot here and we call this Cumulative frequency distribution And as you see here It starts Basically at zero and it goes up to One So that was the last of the histograms And I'm sorry for repeating high school material But we just need to Make sure that everybody Still remember these things Yeah Now there's one question Which is always relevant When we do the histograms And that is How do we select The Interval links Because depending on how we Arrange the data into intervals So how many intervals Are we Subdividing The total range Into We will of course get Different Graphs And The sad thing is That there is no general rule On how to Arrange the data into intervals It has to be done With a certain With a certain feeling For the data There are some Suggestions Which can be found in the literature This one is from the Book of Benjamin and Connell Which gives the Total number of intervals k Equal to one point One plus Three point three Times the logarithm Of n Where n is the number of data So in many cases This provides A first good idea On how many intervals We should Subdivide The data into But it doesn't always work If we look at the Traffic flow data We would get Using this formula We would get A total number of Intervals to Plot up the data in In the histograms Being equal to six And as you see From From this figure In the first representations I've shown you I was using Seventeen intervals But if you We're using Only six intervals You see that the histograms They It looks completely different It There's a lot of information Which is lost In the In the second Histogram here But basically It's the same data So we have to be careful To have a sufficiently Fine Subdivision Of the data into intervals It should not be too fine Because if it's too fine Then we get a lot of Then we may get intervals Or even many intervals Without any data And then again The Information contained In the graphical representation Is weakened So it's not so obvious What the graphs Really show So you You have Probably to When you have a specific Set of data You have to try a little To see what would happen If I modified The Length of the interval Okay Now I would like to Give you another Little small exercise Again relates to The material We talked about At the last lecture And just to give you A little hint Now this relates to The theorem of base They more or less always Start with Let us assume that Based on Your personal judgment And experience You assess or evaluate The probability that The deadline for a given project Will be overrun So that means that you're not able To keep the deadline Or the project will use more time Than envisage And you evaluate this probability To be equal to 0.01 So imagine that You're being informed About a project And based on your evaluation You would say this project is simply Not able to live up to the deadline And the probability That The deadline will be overrun In this case it's very small The probability is 0.01 1% probability that the project Will not meet its deadline Now let's further assume That You know Let's assume this is a fact And speaking out of my experience It probably is a very strong indicator That the Experience of the leader Of the project Is very important for the Success of the project And it's also an indicator Of whether a project Will be able to keep Its deadline or not And then again Assume that you have collected A lot of experience over Your many years as an engineer And you have observed That in 5% of projects With deadline overruns The project leader is experienced In 90% of projects With no deadline overruns The project leader Is experienced So these are two informations Which you have also As background information This is what you bring As an engineer Together with your ability To evaluate simply by Being informed about a project What would be the probability That the project is not meeting Its deadline Now you get one information Concerning the specific project Namely that the leader of the Present project is experienced And now you would like to Evaluate what is the probability That the project Is overrunning its deadline So we have these Three background informations And based on that You would like to be able To come out with an evaluation Of how is this project Performing Will it meet its deadline Given that the project leader Is an experienced guy Experience here Meaning that the Engineer, the leader Has more than 10 years in Business And you can first Think a little about it You can write up maybe Some numbers on a sheet When you evaluate this So in engineering we have many Classifications of engineering Of engineers Don't let me disturb you I will just entertain Those who already found the solution So typically an experienced Engineer is an engineer with 10 to 15 years Of experience Until you have 10 years Of experience you are not Considered really to be An experienced So that's one of the definitions But do you know what an expert is Many engineers are actually Experts Normally the definition we use Is that an expert is an engineer Who is more than 50 kilometers Away from home Have you Been thinking a little I will give you some options In this particular case What to see also the colors Of my assistants If you want I mean it's always A good idea to talk to the Expert sitting beside you You may not find many Experienced engineers sitting next To you but you may find some Experts I think we May want to See some colors Come up with the sheets Let's see What have we Okay It seems very red It seems A little less yellow And very little green From My impression it's The red is dominating But yellow is close I can tell you that I can tell you that The true answer Is 0.0006 Meaning That the right answer was Red And what you can look at Is the application of The Serum of Base If we call the event of Overrun O And the event of An engineer being experienced E as we have here Then we can write up The base formula The project overruns And we would like to Be able to calculate or Assess the probability that the Project will be overrun That the deadline is not kept Given the information that The engineer is experienced So this here Is what we call the posterior Probability Of Overrunning the deadline And what we have here P of O Corresponds to the Now somebody please tell me What does that correspond to It corresponds to what we call the Probability Of So Corresponds to the prior Probability Of not meeting the deadline here This is the prior P of O is the prior probability Of not meeting the deadline What we assessed simply by Looking at the project Getting Our first impression Now the second term Here in the Nominator is The probability Of having an experienced project Given An experienced project leader Given that the project is Not meeting its deadline So this is the conditional Probability of an experienced engineer Given that the Project is not meeting its deadline And that That term We usually call the Yes, thank you And now we are simply Dividing in the denominator We are dividing by the Probability Of having An experienced engineer So we are waiting For the probability of an experienced engineer Given The project is not Meeting its deadline Multiplied by the prior probability That the project is not meeting its deadline Adding up here Again in the denominator With the term corresponding To the Probability Of an experienced engineer Given that the project Is meeting its deadline So it's the complement of O Meaning It is meeting its deadline Multiplied by the probability That the project is meeting its deadline Now if you introduce the numbers From the previous overhead 0.01 was the prior probability 5% Is the likelihood You see here 1% 5% 0.01 0.05 By the total probability Of The event of an experienced engineer Then we become 0.0006 So you see The Base theorem Even In the cases where we Don't have a databases We don't have any Statistical information But we have The Assessed probabilities Can provide Some guidance In regards to What is the value contained In information We can update our knowledge Based on Our Personal Our subjective observations And that's very interesting And it's very useful A different engineer might have Assessed these numbers differently But at least For the individual engineers It provides a consistent basis For updating your own knowledge Or say quantifying The value of your Experienced and your assessment In a systematic way Do you have any questions to this Small exercise The base theorem I can tell you Is really very, very useful And Training a little with this With this theorem In Examples Is really a good idea You should Become really comfortable With the application of this And the triggers In really Understanding how to apply it The triggers to Understand how the Information may be applied In the context Of the base theorem That's really the difficult thing So how Does the information available fit in Does it provide Information relevant for Assessing the prior probabilities Does it provide information Of relevance to Assessing the likelihoods And how can you Formulate the events In order to be able to Get the prior and the likelihoods The formula itself The theorem itself is extremely Simply anybody can learn that By heart But the transformation of the Available information into the Use of the base theorem Is probably what requires A little training and therefore Even though you're able to Remember the formula Is probably not enough you should Train yourself In using the available information Such that it fits in To the application of the theorem Okay We go back To our descriptive statistics And What we want to look at now Is what we call quantile Plots So I introduce a definition Namely that the Q's quantile Corresponds to the value in a Data set which is Exceeded by Q multiplied by 100% Of the data Okay 100 minus Q times 100% of the data so if I'm talking about the The 0.75 quantile Then this quantile Will be Exceeded By 100 minus 0.75 multiplied by 100 so 25% Of the data So the 0.75 quantile in A data set is exceeded by 25% of the data in the data set So this is a way to understand The quantile in a data set And the way The quantile plots are generated Is simply by plotting Up the data against Their corresponding quantile Values now the data we already Know now we only need to Calculate the quantile values And The quantile values Are usually calculated from The ordered data sets this is Now it's really really convenient to Have the order data sets In by having that We can calculate the quantile Values that would be the QIs Simply is the Number Of the individual data In the ordered data set This number is i And then we are divided By the total number of data plus 1 So Looking at the traffic data From before We have the traffic data The order data sets for the two Directions here We have them going down here In the columns in increasing Values And correspondingly For each of those two Columns with Observed number of cars In the two directions We can calculate the quantiles So the quantiles are not Directly related to the value Of the data In the data set but they are Related to the Order of the data In the data set so the quantile values Are the same For these two Data sets there is only one Quantile value For each pair of the Observed number of cars In the two direction And what we can now do is that We can plot up The Data point is a function Of the quantile values So on the x-axis we have the quantile Value and on the Y-axis we have the number of cars Corresponding to the quantile values And of course we can do that for This example for the two Directions individually And this is what I have here for direction 2 So the black spots The black dots on this Quantile plot Corresponds to the To the quantile values And now there is a couple Of observations to be done A couple of characteristics Which are very useful And that is First of all the median value Now we already Introduced the Sample mean value we introduced the Mode and now I am introducing the Median value The median value is the 0.5 Quantile value It means that the median value Is the value in the data set Which will be exceeded by 50% of the data Yeah so The probability of having a A higher value would be 50% Also meaning that The probability of Having a value Smaller than the median Would be 50% Are you Do you feel comfortable with the concept Of sample mean And Mode And median They are all central measures For a data set But do you have a feeling for how They relate to one another Are they more or less the same Try to think a little about that Because I will ask you a little later Now there is also what we call A lower quartile The lower quartile in a data set Is the value Which corresponds to the 0.25 Quantile Value And the upper quartile value Is the 0.75 Quantile value So the lower quartile value For that value 75% Of the data will have Will exceed this value Then there is Another very convenient Way of representing data And this is in terms of the Or by means of the So-called turkey box plot The turkey box plot is again It is a standardized way of Representing data. Now This relates to the Quantile values So Up along this axis here You imagine that we have the data values Yeah So normally we have a data value Here Now in this box in the middle The midline corresponds To the median The upper edge of the box Corresponds to the Upper quartile value The lower To the lower quartile value Now we have a line Going up to a bar here A horizontal bar And this value we call The upper adjacent value And That Again is The largest value Less than The 0.75 quartile Plus 1.5 Multiplied by r Where r is the Interquartile range So that's the difference between the Upper quartile value and the lower quartile value And again the lower adjacent Value which is the lowest value Larger than the 0.25 quartile Minus 1.5 multiplied by the Interquartile range So what is Really useful here It gives an immediate Information about Where the data are located So we have the information about the median 50% of the data are above The median 50% below the median Now the Interquartile range We know that 50% Of the data are located within this interval So that gives us An immediate impression about Where are the data located here And then we have the Adjacent values Which gives us an indication about the Dispersion of the data Outside of the Interquartile range And then whatever values we have In the dataset which are outside The adjacent values Are called Outside values And we plot them up directly Normally there are not so many So most values should be Contained within the Upper And the Lower Adjacent value How can we do that And you can Try to practice a little On the examples in the textbook Is that First of all we need to evaluate these statistics So the numbers Which goes into the Turkey box plot Namely the Lower Adjacent value The Lower Quartile The Median The Upper Adjacent value And then we need to list up the outside values So we make a table like this Where we have For Direction 1 and Direction 2 The corresponding values So the median For Direction 1 is The median For Direction 2 Would be 5100 And around these Medians you have the corresponding Lower Adjacent Quartiles Upper Quartiles and Adjacent values And here in the Lower Part Of the table for Direction 1 And Direction 2 we have the outside values They have to be plotted Individually Now I've done something which is very Very bad And Is anybody So looking at the graph To the right On the overhead Can anybody give me an indication What is really really bad In this graph It's something We often do We are not supposed to do it But we forget it Now and then I will simply just explain What is in this part of the overhead Here we have the turkey box plot Corresponding for Direction 1 And Direction 2 Of the traffic data And When we look at these figures We get a graphical impression Of where the data are located Within the ranges Of the axis And when we are Such two graphical representations Next to one another Of course the obvious thing we do Is that we Get an immediate comparison Between the data of the two data sets And then there's one thing Which we really really need to Be careful with If we want to communicate data To other people And we want the other people To understand what is contained in the data set And we really need to be concerned That the axes In the figures Which are aligned Correspond to one another And what you see here In this figure is that the axes are different And that's really bad Because that's in a way It's a kind of lying You can understand it Is that maybe an attempt Of not being honest Is the person We are communicating with An impression Which is actually not contained in the data And of course you can do that You can very easily do that It's like magic You show something And you keep away Some other information So it's manipulation Of course I did not Intend to manipulate And really to communicate this Very important aspect Of representing data In a good and honest way You know there is a saying Which goes more or less Like this There are liars There are damned liars And there are statisticians These are the three degrees Of liars Then we have something which we call Quantile plots And these are relevant if we are considering Two data sets So like the traffic in the two directions Now we can plot up In one diagram In one graph The corresponding values In the data set Which have the same quantile value So You remember In the tables In the table to the right In this figure here We have the quantile value for the two data sets Now that means for each quantile value There are two Corresponding values In the two individual data sets So for the 0.03 Quantile value Which is located in the Over right corner of the table You see that for direction one We have 3087 Observe Number of cars 3677 And if we now plot up This pair of values So 3087 3677 And Jointly In a figure like this Then we get What we call a quantile Quantile plot We do that for all the data sets In the table With the same quantile values And that kind Of representation shows You how the data are located Relative to one another So what are the characteristics Of the traffic flow in direction one Relative to the traffic flow In direction two And What you see from this figure here Is that there seems to be a Systematic Higher Traffic flow In direction two As compared to direction one We also have a plot Which is called the mean versus Difference plot Where on the one axis We are plotting up the mean value Of the data in In the two data sets So the value Y i plus X i Divided by two simply the mean value Then plotted up against the difference Between the two data sets So Y minus X And we do that for all the data Which we have In our set of observations This is Related to the So the information contained In a plot like this Is related to the information contained In a quantile quantile plot So what we see here is again We have a systematic change We have a difference Systematic difference in the data Up to A traffic flow A mean traffic flow Around Let's say 6000 cars per day And there seems to be A systematic higher Traffic Flow in direction two As compared to direction one Up to this value From this value and upwards We have a nice Correspondence Between the mean And the difference So in summary At the graphical representations We have one dimensional scanner plots Illustrating the range And the distribution of a data set Along one axis This is more or less the first Graphical representation Just to see the data We have the histograms We have to subdivide the data into Intervals The selection of the interval size Is a little problematic We have to try to get a nice Representation of the data By selecting the intervals Appropriate And we may have to Iterate a little to do that So one or two tries Is normally sufficient to have a nice Graph Using the histogram We get an immediate idea of Where is again the central Value located We can directly observe the mode Of the data We also see the range of data And we get an idea About the symmetry of the data How are the data Distributed around Say the mode Of the data Then we have the quantile plots The quantile plots We illustrate Immediately The location of the median They also indicated The distribution And the symmetry of the data We have the turkey box plot From which we Specifically Indicate The location of the median The upper and lower quartiles So the data exceeded By The data Respectively in the two directions We also get an idea About the symmetry From the turkey box plot And the Dispersion of the data We have the quantile Quantile plots And we have the mean versus Difference plots And using These types of plots I have the Impression that some of you are packing your stuff But it's way too early So Please Be ready And please follow me carefully Now we have the QQ plot Then the mean versus difference plots These Types of plots are very Convenient when we want to see How data In two data sets Are located And dispersed relative To one another And now I hope That you are really ready For the last exercise of today And it's a very easy Thing I guess Because it relates To the Understanding of What is the difference Between a sample mean value And the mean value Are you All listening carefully I see a guy with some glasses Sitting there Three tables from me And he's not even listening to what I'm saying now He was smiling but now he's not smiling so much But you can keep on smiling It's okay But I was addressing the question What is really the Characteristic Of the sample Mean Relative to the characteristic Of the median value And I want you to consider This small example where we have A group of persons I could have selected A subgroup of you And now if we look at the Ages of persons In this subgroup I have Here the ages of Nine persons And first of all I would like you to give me The mean value Oops Hang on Now the mean value This can be very fast So you already know How to evaluate the mean value So come on The colors and the colors Are Very consistently green I would say Which was a very good answer So we don't need to discuss that much Now The median is The median You don't need to think so much Come on, more colors It's very consistently yellow Whoa Now we can see that green is equal to yellow That's interesting No, don't make conclusions like that But it's the right answer Now I have put in one more data In the data set So I have included the age of Myself Now Come on, I would like to know What is the mean value now Come You're using your coffee break I want more colors More More It seems to be very green 30.5 Which was a very nice guess And Now I want to know the median Come We need more colors Red 25.5 Which is completely correct So The conclusion is That whereas the mean value In a data set can be very sensitive To just One additional Number in the data set The median Is in general Very insensitive Okay So don't get the confusion In your head that the median And the sample mean values Are basically the same Because they're not Thank you very much for today