 All right, so we're back. Let's start with the lecture. So like I told you, today we will be doing descriptive statistics. So very basic statistics that you can use to kind of describe the data that you have and how the data that you have fits into what you expect. So for today, we will first be describing univariate statistics. So univariate statistics are statistics based on a single variable. And we will talk about things like central tendencies, like what is the mean, what is the median, and what is the mode of your distribution that you're working with. We will look at dispersion, like how can I use R to calculate things like range, quantile, variance, and standard deviation. I will say a little bit of words about outliers, because outliers are always difficult to deal with. And of course, what is an outlier is up to you. And then I also want to talk a little bit about the shape of a distribution, because normally in biology when we work with distributions, distributions tend to be normal distributions, but not perfect normals. So there's always some kind of eschewness in there or a kurtosis. Furthermore, I just wanted to show you a very quick introduction about boxplots, histograms, and images and heatmaps. So how you can use R to make those. They are not, well, boxplots can be univariate. Histograms are definitely univariate, and images and heatmaps generally tend to not be. But we will have a whole lecture about plots, but I just wanted to give you guys some tools for making plots. And the tools, of course, will be also used in the assignments. Although the assignments for today are really, really short, so if you didn't finish the assignments from last week, you will have plenty of opportunity for this week to kind of catch up with the assignments. And of course, you directly have the answers online, so you can kind of cheat a little bit and look at the answers that I did. If, of course, you didn't finish it yourself. All right, so first off, statistics is the study of collection, analysis, interpretation, presentation, and organization of data. So again, on an exam, this might be one of these questions, like what are the four key things or five key things about statistics. So statistics means I'm collecting data, I'm analyzing data, I'm interpreting it, I then present it to other people, and of course, the organization of the data itself is also something that a statistician is involved in. And of course, univariate analysis means analysis done on a single variable, like body weight or length or length of a tail or something like that, while bivariate analysis generally concerns multiple variables and the relationship between them, like how does the body weight influence the tail length or how does body weight influence height or what is the relationship between them. So today, only univariate statistics, only single variables, we are going to kind of look at and organize. So a question to you guys. When we talk about the mean, in German, this is called the Mittelwert, how many different means are there? Because people always say, I calculated the mean, and then they just give you the number. But like in many scientific papers, people never mention how many mean or which mean they used. All right, so first guess from General Gulak, two. All right, anyone has more? Anyone has less? More or less? Up, down. Are there more than two means? Is there just one mean? Is the mean always the mean? Alexander says four. Lydia says three. All right, average, weighted average, aromatic and geometric. So already we have a couple. So are there no more? That's just it, like the average. The median is not the mean. The mean is definitely not the mean. The median is something completely different. And there's actually only one median. So although the mean can have very different meanings, a median is always, there's only one median. All right, so the answer is that there are three. So the first one is the one that everyone knows. It's the aromatic mean. It's just the sum of all of the numbers divided by the number of numbers. And this is the way that you would write it down, or mathematically, where you say the x bar, so the mean of x, is equal to x1 to xn divided by n. So the sum of the numbers divided by the number of numbers. The next one is the geometric mean. So the geometric mean is, well, I think I have examples in there of what they mean. But the geometric mean is very similar to the arithmetic mean, but instead of summing them up, you now multiply the numbers together and then take the square root, not the square root, but the nth root, and n being the number of numbers. So as an example, if I have these numbers here and I divide them by 5, then the arithmetic mean is 42. The geometric mean is only 30. And this is the way that you would write it down, mathematically, so you say that the geometric mean is equal to, and then here, instead of the sum, we have the product. So the product of 1 to n of the numbers to the power of 1 divided by n. I hope everyone knows that the nth square root is the same as 1 divided by n to the power of 1 divided by n. All right, furthermore, we have the harmonic mean. So the harmonic mean is very similar to the arithmetic mean again, but instead of summing up the numbers and then dividing by the number of numbers, what we do is we take the number of numbers and then we divide that by 1 divided by the number itself. So in our case, the harmonic mean for the numbers that we have is 15. And the way that you write it down is x bar equals n times than the sum of 1 divided by the number to the power of minus 1. So power of minus 1 is just 1 divided by. So three different means. So when do we use them? So the geometric mean is often used when comparing different items that have different numerical ranges. So the arithmetic mean I'm not going to explain. I think that everyone knows when you use the arithmetic mean. It's just when you sum up things and you just divide by the number of numbers. But the geometric mean is very useful in things like finance or when you have numbers which have a very different range. For example, if I have two companies and I have measurements on these two companies in two different ways. So I have, for example, an environmental sustainability measurement for the company which ranges from 0 to 5. And I have, for example, the financial viability from 0 to 100. Of course, a company scoring a 5 on environmental sustainability and a 0 on financial viability, if we would calculate the arithmetic mean, this would be very unfair because the first one only has a range of 0 to 5 and the second, more or less, measurement has a range of 0 to 100. So if we would just take the arithmetic mean, then the financial viability would count 20 times as heavy as the environmental sustainability. And we don't want that. We want both of these measurements to count equally heavy in the mean. So this is when we use the geometric mean. So using the geometric mean, we can get a single figure of merit so we can get a single number, which kind of weighs the two measurements equally. The harmonic mean is also called the sub-country mean is when you have a situation where you have rates or ratios. So the harmonic mean provides the truest average. And this is one of the examples that I often use. For example, we have a car that travels an unknown distance at 60 kilometers per hour. And then the car travels the same distance at 40 kilometers of hours. What is the average speed? If we would calculate the average speed and saying, well, no, the car traveled 60 kilometers per hour and then it traveled 40 kilometers per hour, you would think that it traveled at 50 kilometers per hour. But that's, of course, not true because the key is here the same distance. So it first drove 60 kilometers. So it drove an hour to do 60 kilometers. And then it drives 60 kilometers at 40 kilometers per hour. Of course, this one takes much longer, because you do still 60 kilometers, but now you do it at 40. So the harmonic mean will now give you the average speed, the real average speed of the car across the whole trajectory that it did. Of course, when we were, if this question would have been posed as if a vehicle travels for a certain amount of time at speed x and then the same amount of time at speed y, then of course the average speed is the standard arithmetic mean. But because here we are using the same distance, now we are dealing with the ratio because we have 60 kilometers per hour. So for example, 60 kilometers at 60 kilometers per hour and then we have 60 kilometers at 40 kilometers per hour. So the difference is subtle, but the difference is based on the fact that of course the same distance at a lower speed takes a longer time, so that means that it lowers your average speed across the whole trajectory that you're doing. In R, you only have the arithmetic mean. So in R, if you wanna calculate the mean, you can use the mean function. The mean function takes a vector of numbers and then there are two important parameters that you can set, one of them we probably already saw before, but we have the NA.RM and this indicates whether NA value should be stripped before the computation because otherwise if you have a single missing value, R will tell you that it cannot calculate the mean unless you specify NA.RM equals true. So you tell R, leave out the missing values. Of course, if you have missing values and you wanna calculate a mean, then of course the NA.RM is false. So the standard that R has is probably the truest mean because you can't say what the mean is when you are calculating a mean number when there is a missing value because then the missing value could be anything and since you don't know it, you also don't know the mean. But if you want to get an estimate, then of course you can set this to true. Besides that there's the trim option and that is the fraction of observations that should be trimmed from each end of X before the mean is computed. So you can, for example, leave out 5% of the data on the lower end and 5% of data on the higher end, which is done a lot when calculating means because you don't want to have very high values or very low values having a big impact on your mean. So if you say the mean, does trim also work when you have two high values at one and one extreme value at the other? No, no, it trims the same number at the bottom and on the top. Of course, if you don't want that, you can do it yourself, right? You can pre-process your vector with measurements and leave out the measurements that you think are outliers. But the trim is very useful when you, for example, do this on a matrix, right? So if you have a matrix with 10,000 rows and you want to calculate the mean for each row, but you want to discard the top 5% and the lower 5%, then you would use the trim function because, of course, at every row, you're not going to make a plot, see how many outliers there are and then remove them manually. So the trim is there for when you do 10,000 mean calculations or 100,000 mean calculations in one go. All right, so the median already mentioned before is the number separating the higher half of the data sample from the lower half. The median can be found by arranging all values from lowest to highest and picking the middle one in case that you have an even number of values. There is no middle value, right? Because you don't have 50% below and 50% above, but 50% is exactly between these two. And then the median is defined as the uttered metric mean of the two middle numbers because, of course, when you have an even number of number, then you have two which are in the middle and then you just take the uttered metric mean from it. So in R, again, you have a median function. So, hey, you can do a median function on a vector and this will calculate the median. And again, we have the na.remove parameter to remove missing values because, again, just like with the mean, if there's a missing value, then the median officially is not defined, but you can say, give me a median anyway. So the mode is the value that appears most often in a set of data. So the numeric value of the mode is the same as the mean and the median when we're talking about a real Gaussian distribution. So if you have a perfect normal distribution, then the mean is equal to the median, which is equal to the mode, to the number that occurs the often. But, of course, this only holds true when you have a perfect Gaussian distribution. If you have a slightly skewed distribution or a highly skewed distribution, then, of course, the mean and the median and the mode will be different. And then, of course, it is up to you on how you want to present your data and if you want to present like the mean or if you want to present the median. Often, when data is really highly skewed, at one end, people go for the median. When there is kind of a skew on both ends of the distribution, then people will often refer to the mode as being the truest average of the distribution that you are looking at. So, again, the mode, the number that appears most often in a set of data. I think R has a mode function. Let me check, actually. Yeah, so again, like the median and the mean, you just have a function called mode, which will calculate it. So, when we are talking about dispersion, then we also have to talk about the range of the data. So the range of the data is known as the difference between the largest and the smallest value. So in R, we have a range function and this range function gives the minimum and the maximum value back. So if we want to get the range of the data, the official mathematical range of the data, we have to say diff of range of x and x here are the numbers, are measurements. And this is because R, of course, when you ask for the range, it gives you the minimum and the maximum and we want to know the difference between these two numbers because that is the real range of our data. All right, so instead of having, so you can describe your data saying, well, my data has a mean of five, the median is 5.3, the mode is 5.7 and the range is from one to 15. That's the measurements that we have. And by giving these numbers, some people who are relatively good with numbers can already kind of guess what kind of a distribution you have and how your kind of histogram would look like. But there's a different way of mentioning how your data look like and that is using quantiles. So quantiles are very similar to the median, right? The median is the middle number, but the quantiles are a kind of, they're an extrapolation of the median, right? So if you have the median, you divide lowest 50%, highest 50%, but when you do quantiles, you divide your data after ordering. So same thing, you write them down from the lowest to the highest value and then you divide them into Q, essentially equal size data subsets and that is the Q quantiles. So what I show you here, this is the four quantiles, right? Because I'm dividing my data into four groups, a group from zero to 25%, from 25 to 50, from 50 to 75 and from 75 to 100. And so quantiles are data values marking the boundary between consecutive subsets. So when I have all of my data values, what I'm just doing is saying, well, I take the lowest 25%, I take then the median 25, or I take then the median and then I take the highest two thirds of the data and this will give you a very good overview of how your data looks like. And there are nine types, different types of quantile computation and we won't go into it, R just provides one, but there are eight other ways of calculating quantiles. And of course, when I do the quantile of a uniform distribution, right? So a uniform distribution means that every number between zero and one is an equal chance of being drawn, then you will see that the quantiles are very similar to the kind of numbers that you would expect, right? So the zero, the quantiles 0.004 because we didn't draw a real zero, the 25% quantile, meaning the lowest 25% of data ends at 0.27, the 50% of data ends at 0.5, 75% ends at 0.75. And of course, quantiles are very useful. Testesaurus, quantiles give the median from the four ranges. Now, quantiles, so a median is a one quantile because you are dividing your data or it's a two quantile, right? Because you're dividing your data into two subsets, the lowest 50% and the highest 50%. You can also divide your data into the lowest 33%, the middle 33%, and then the highest 33%. That's the three quantile. What we're looking at here is the four quantile. So you divide 25%, lowest 25, the half of the data, right? So to the median, and then you have another 25%, and then you have the highest 25% of your data. And you can of course divide your data into as many quantiles as you want, right? You can say give me four quantiles, or give me seven, or give me eight. But very general, the quantiles are there for dealing with normal distributions because in normal distributions, you generally want to know what is the lowest 5% of data and what is the highest 5% of data? Because we're doing statistical testing often, we want to know the 5% threshold here comes from the alpha value. So I have, for example, a distribution of test scores on random data. And then I want to know, okay, so the lowest 5% is significantly smaller than, so if I have my real measurements and I compare it to the 5% quantile, so that is the 20, the Q equals 20, right? Because I'm dividing it in 20 equal parts. So then the first part is 5%, the second part is from 5% to 10%, and so on. So the quantiles, they kind of show you how the normal distribution looks like. And generally we're only interested in the lowest 5% and the highest 5%, because when we are doing simulations and we are kind of wanting to figure out what the like 5% threshold is, then we can use the 20 Q quantile to kind of calculate this for us. So it's just dividing your data into equal groups and then giving the numbers for these groups. And of course, based on these numbers, we can, hey, if you would not tell anyone that you drew numbers from a uniform distribution, but you would tell that these are the quantiles, like the four quantiles, then someone who is kind of into numbers and statistics will directly say, oh, but then you drew from a uniform distribution, right? Because the lowest 25% ends at like 25% of the total, the lowest 50% ends at, so that's what quantiles are useful for. They are just there to describe the shape of your data and describing the shape of your data is something that you need to do for others, for example, in your paper to kind of figure out if you use the proper statistical test or if your statistical test is valid for the data that you're looking at. All right, so quantiles kind of divide your data into equal subgroup, and then we also have the spread of the data. So the spread of the data is related to the range, right? So the range is the difference between the lowest number and the highest number in your data, but the variance is a measurement of a spread of a set of number. So if your variance is zero, it indicates that all the values are identical, right? There is no variance, right? So if I say that I have a mean of five with a variance of zero, then people know that all the thousand measurements or something that I did are five. Variance is always non-negative. So variance cannot be, you can't have negative variance. So variance is kind of a measurement of how spread out your measurements were compared to the mean or the median of your data. So in R you have the var function. So you just have a function called var and then you can just put your numbers into that and then it will give you the variance. R allows you to calculate two types of variance. The first type of variance is the population variance. So if you have measured all available samples, for example, imagine that I have a mouse population which is 500 mice and I measure all of them 500 mice, then I can calculate the population variance and then I use the big N which represents 500, meaning 500 is the total population. There are no more. I cannot measure anything more. If I only took a finite sample, right? Imagine that out of the 5,000 mice that we have in our mouse house, I only measured 500, then I'm calculating the finite sample variance, which means that there are more measurements that I could have done, but I didn't do them because it's just a finite sample from more or less either an infinite population or a population which is bigger than the sample that I took. So the first one is noted. So population variance is called sigma square. Finite sample variance is called S square and this often goes wrong in papers. So if you are writing a paper and a mathematician would look at your paper and you use sigma square, right? So I'm writing a paper, for example, about humans and I'm saying that the sigma square of my distribution is 0.5, then a mathematician will say, no, that is wrong because you did not measure all humans on the planet. You only measured a very finite sample of humans. Jimmy Grump, thank you for following. So there's a difference between the population variance and the finite sample variance. Of course, if your sample size, your finite sample is big enough. So if small n approaches large n, then there will not be a difference. But keep in mind that there is a difference between population variance and finite sample variance. So the definition, of course, is the x square average. Is the population variance the true value? Yes, yes, the population variance because you measured the whole population, you can't add any more measurements so it can't change at all, right? You measured all humans on the planet and the average of humans was one meter, 60 centimeters and the variance that you've measured is 20 centimeters. So that means that that's kind of how the distribution looks like. But yeah, the population variance is the true value. What does it mean if variance is bigger than an aromatic mean because of outliers? No, the variance just describes how spread the numbers are around the mean. This means that if you have a mean of, so imagine that you've measured like human height, right? Like one meter, 60 on average and your variance that you measured is two meters. That means that there were humans who had a negative height. So when you measure numbers, right, then these numbers tell people if you did stuff correctly or not. So variance is nothing more than giving you a kind of an idea of how spread out your measurements were if they are very close to the mean or if they're very far from the mean. All right, so next question. So what are we trying to calculate when we do a regression with a finite sample, right? All right, so the difference between the finite sample here is that the finite sample is, so the finite sample, what you want to calculate with a finite sample is you want to give an estimate of the variance in your measurements. And this estimate is, of course, the more samples you measure, the more the finite sample reaches the population average. So the real population variance. So by having, or by computing a finite sample, you have to keep in mind that this variance comes with a 95% confidence interval, right? Because you cannot be sure that the population variance is equal to your finite sample variance. So the difference between the finite sample and the population, you can estimate kind of how right you are. So when I calculate variance and I calculate variance, for example, on my population of 500 mice, and I only use 10, so my finite sample N is 10, then I can kind of estimate how big the variance could be because it could be much smaller than I initially estimated, and it could be much bigger. So finite sample is, well, is a represent, no, it's just the number that I used. It's not, it doesn't have to be representative, right? If I have 500 mice and I only measure 10, then the finite sample is not representative of the 500. But I can get an idea of how representative it is and what the boundaries will be for the population. So that's the idea between the finite sample and the population size, because if you measure the whole population, you know the truth. So at that point, you don't have any error around it anymore. But normally when you calculate variance, then you say I have a population with a mean of three, a variance of 0.5, but the variance, the real variance is somewhere between 0.44 and 0.56, for example, right, because you only took a finite sample. But if you talk about sigma, right, if you talk about the population, then there is no error in your variance calculation. So yeah, the finite sample should be a representative sample of n, but it doesn't have to be. But I hope that's clear. And of course, these are the ways to calculate it. So you just take stuff to power of two. Like, I don't need you to know how to exactly calculate. And I'm not gonna ask you for the formulas as well. I just put the formulas here to show you that the only big difference is that you here use n minus one, and here you use n, which is the real population. So there's a slight difference as well in the computation. So in one case, you divide by the real number of samples in the whole population, and here you divide by the number of samples that you have measured minus one, and this has to do with degrees of freedom in your data. All right, so dispersion, variance is one thing, but people generally like to get standard deviations, right? So a standard deviation is a measurement that is used to quantify the amount of variation or dispersion of a set of data. Again, a standard deviation of zero indicates that data points are very close to the mean. So in R, you can use the SD function to calculate standard deviation, or you can use the square root of the variance. If you want to get a much better estimate of how it looks, then there's also the average absolute deviation, but generally in papers, people just tend to mention, well, we measure humans, humans on average are one meter 65, and the standard deviation that we measured is something. And the standard deviation is, of course, related to the sample size that you used, and it will give you an indication of how the normal distribution looked like, right? Because like 95% of the data is within minus two to positive two standard deviation. So again, you're just dividing your normal distribution into groups, and a standard deviation means 25% of the data. So if you go two out, you have like 95% of your data. Anyway, the important thing here is you can just use the SD function or the FAR function in R to calculate it and just mention it in your paper. All right, so outliers are something that also relate to this, right? Because we have our measurements, and then sometimes we have an observation point that is very distant from all the other observations. So it can be that there is a real variability in the measurement. For example, we could have a very heavy tail distribution that means that our original distribution is not really a normal distribution. Something like this happens when we talk about human height, where we set, for example, a threshold saying that we do include dwarfism people, right? So people with dwarfism are also measured, right? Of course, they are part of the human population, so they should be measured. But of course, if you are calculating, if you use your finite sample variance to approximate the real variance of the population, then of course, it's very unfair if you have 100 people measured, and by chance, 10 of them are suffering from dwarfism because in the whole human population, not 10% of people suffer from dwarfism, right? So in this case, these 10 measurements seem like outliers, but they are not because there is really this variability in this measurement. Person with dwarfism really was only like one meter 10, right? So they are very far away from the distribution, and here, these measurements are really hard to deal with because normally you want to sample from your population, and if you know that like one in 150,000 humans suffers from dwarfism, then of course, you have to at least measure 150,000 humans before you are allowed to include one person with dwarfism. But of course, if you measure humans, then you would probably measure like 500 or 1,000, which is already a lot of work, and then of course, having one or two people with dwarfism in there means that you have one or two measurements which look like outliers, but they are not because they are real measurements, they're real people, you can look them up and they're really one meter 20, and they're not anywhere near the 160 that is expected. Of course, the most common reason why we have outliers is experimental error. Someone was measuring mice in a lab and they're doing dissections, so it's all bloody, and they are writing down things like tail length and for one mouse, they put the comma wrong, right? So instead of writing tail length is 7.3 centimeters, they wrote down the tail length of this mouse is 73 centimeters, so they forgot the comma, or they put the comma in the wrong place. And of course, this is experimental error, so and how do you deal with outliers? That is very subjective, and if an observation is an outlier, also something which is subjective, right? Because like something that I might think this is way too far from the mean of the distribution, some other people might think that that's not the case, like well, it's only three standard deviations away, so it's not that crazy, right? Or it's only two standard deviations, so it's not that crazy. So it is a very subjective subject, and here I want to introduce to you one term on how to deal with outliers, so generally outliers are replaced by missing values. However, there was a lord in, well, the beginning of the 19th century, and he came up with a term and this term is named after him and this term is Windsorizing, right? So if I make 100 measurements, and I see that two of these measurements are way, way too big, but they are not a comma error, right? Because even if I would shift the comma, then still it would not be in the distribution or it would go from being massively above the distribution to being massively below the distribution. So what Mr. Charles B. Windsor said, well, if I have these measurements, you know what I'm going to do? If a measurement is much, much bigger than the biggest measurement that I'm having, I'm just going to replace this value by the biggest measurement. So I'm just going to strike out 73 centimeters and the maximum that I measured is 12 centimeters, so I'm just going to change 72 into 12. Of course, scientifically this is a practice which is questionable, but it is an allowed practice. If you have your measurements and one of these measurements just looks totally off, then you can just say, well, we Windsorized our data and we Windsorized two numbers, and that is fine, and everyone will accept that in scientific papers, which I always find really, really funny and that's why I like telling this story because we always say that data is sacred in a way, but it's kind of not. You can just kind of Windsorize your data and say, well, we found two outliers and we kind of Windsorize them into the standard distribution, which is perfectly fine, and allowable and a very common scientific practice. So you can do that. Another way to deal with outliers is when you find an outlier, which is very weird, you are going to say, well, I'm going to remove this value and replace it by the average of the distribution. That is also Windsorizing. In economics, you just cancel outliers. You mean as in, you just remove them as missing values, right? So just put your thumb on it and just pretend that you never measured it. Yeah, yeah, that's in biology also the standard, but you don't have to, right? Because sometimes you don't want to introduce a missing value. For example, if you're doing principle component analysis, the analysis itself requires your data to be complete. You can't have missing values when you do a principle component analysis. So if you want to do principle component analysis, you have to deal with this outlier and generally people just Windsorize it in. So that means that if it's much larger than the largest number, they kind of make it the largest number or they give it the value of the mean. And if it's much lower, then it's either like the minimum value that they measured or they measure or they also Windsorize it to the mean. And Windsorizing is a perfectly acceptable scientific practice. You just have to mention it in your paper. Yeah, that's why I'm talking about it for like five or six minutes. I love the term Windsorizing and I love using it in papers. Every time that I can sneak it in in a paper and I say, well, we Windsorize the outliers away, then I do. So I really try to do, I'm not wanting to cheat with data and it's not cheating, right? Because you're honest, right? You say we Windsorize and you tell them how you Windsorize and then it's perfectly like reproducible. And if it's reproducible, it's science. So you just have to write down what you did. And if you did is like something completely crazy like saying that, well, this number was 93 or our measurement was 93, but now the measurement was 72. Then that's valid, it's acceptable. All right, enough about Windsorizing. His life story is also pretty funny to read up on. He was not the best scientist in the world and the fact that he still remembered for just skewing your data in a way that you just want it to be skewed and writing it down is quite interesting. All right, so talking about skewness of data, right? I just mentioned that it can be a real measurement, right? If you measure humans, you tackle mean instead outliers which calculated without outliers. What do you mean by that, Alexander? Tech mean instead outliers? Which calculated without outliers? Could you explain on your question or elaborate a little bit? Yes. All right, you type. I'm going to go to the next slide and then once you have your... Then we will discuss it. But yeah, no, it's perfectly fine to Windsorize some of your data before. So just make a histogram. You see five outliers. You just push them into the distribution. No problem, no problems asked. Everything's fine. Just write down what you did, which numbers you changed. Yeah, so here we talk about that an outlier can be true, right? We can have a dwarf in our dataset. Okay, yeah, just send me an email, no problem. And so if you have a heavy tail distribution, like, oh, this is not the kurtosis. So the kurtosis is this, right? So here we have three normal distributions. Only the middle one is a really normal distribution. So this is called a mesocurtic distribution. But we can have normal distributions which are much more peaky, right? So the variance is much less. And then we call it positive kurtosis, which is also called leptocurtic, right? So leptocurtic means that we have a normal distribution. The data follows more or less. But the normal distribution doesn't have the standard variance that we would expect. A platykurtic distribution means that there's negative axis kurtosis, which means that we have a normal distribution which kind of someone pushed on, right? So the spread of the data is much, much wider than we would expect based on a normal distribution. You can actually calculate this in R by using the site library. And then there's kurtosi and kurtosi of X. So of your measurements will tell you if there is zero axis, if there's a positive axis or a negative axis, and then you can write in your paper. So we measured human height and found a positive kurtosis normal distribution, right? So it's just another way of telling people how did your distribution look like. We also skewness. And skewness, again, here it comes in if you have like a lot of people with dwarfism in your sample or a lot of people who have suffered from giantism. And this is a measurement of the asymmetry of your distribution, right? Here we have a perfectly nice normal distribution. But now we see that some of our, that there's a lot of measurements which are smaller than what we would expect based on the normal distribution. So we can have a negative skew and we can have a positive skew. So the long tail is this one and we have a fat tail, which is this one. And it can also be undefined, which means that your distribution is a normal distribution. Grisopida three, maybe this is what he meant. The mean is not calculated if there are two extreme outliers because in that case, you would trust the medium. Yeah, the medium in the case that a distribution is not a normal distribution, but highly skewed or has a lot of outliers, then the median is considered the truest average, right? And I'm using the word mean for the harmonic mean. But the average is kind of, is the middle of your distribution, right? So if you look at income, right? So how much people earn in a population? And then of course, if you would calculate the mean, it would look very positive because if you have like a couple of millionaires, right? They earn like a thousand times compared to what an average person earns. So that would shift the mean up tremendously, right? Because they earn thousands of times more. Like if you have Jeff Bezos in your country, and then that will kind of make the mean income like much higher. And then if Jeff Bezos moves, then all of a sudden the whole thing moves back, right? Because his income is so much more than the rest of the people. Yeah, of course, in this case, it's a real measurement. You can't Windsorize it in. You can't say, well, Jeff Bezos, you only earned like a hundred thousand because that's not true, right? So the measurement is real. So there's no way of saying that I'm going to ignore his income or I'm going to make his income less because it's a real measurement and it's something that's just fixed. So Windsorizing in this case would not be. But if we talk about the average of a distribution, so the kind of what we understand in kind of normal English language as the middle of the distribution, then indeed when there is some extreme or some extreme values, like a couple of millionaires or a couple of billionaires in your population, then these have a large effect on the mean, but they have a very small effect on the median because the median is kind of the lowest 50% and then the highest 50%. And it doesn't matter if someone in the highest 50% earns like 10,000 times more than someone else because the median will be the truest average. So is this clear? So we have a long tail and then we have a fat tail and it can be undefined which means that you have no skew, but you have negative skew and positive skew and a long tail means that the tail goes on forever and ever and ever and a fat tail means that the tail is very broad. So it doesn't go on very far from the thing, but it is very big. So and again, you can use the psych library in R and it has a function called skew and this will tell you if there's negative skew, if there's positive skew and it will also tell you, I think if it's a long or a fat tail, but the long and the fat tail are not that interesting because they don't use, but skew is something that you can compensate for. So if you have a skew normal distribution, you can still use parametric statistics in some cases if the parametric statistics that you are using can deal with skew. The same thing holds for kurtosi. If you have a distribution which is relatively normal but kurtotic, so push down or pushed up, right? So that the standard deviation is too much or too little then sometimes you are still able to use things like parametric statistics, which gives you more power. So and then if your statistic that you are calculating deals with the kurtosis and the skew, then you can just give the skew and the kurtosis numbers and then had the statistics will be adjusted based on your skew and based on your kurtosis. I do love the word platypkurtic, by the way. It always makes me think of platypi and that's just like, I love like Phineas and Ferb. So there you have it. Anyway, so kurtosis. I think it's a good time to take a break and I forgot the recording. Why does no one tell me that I should do the recording? Now I'm sitting again tomorrow for like an hour listening to myself doing the recording. It's horrible, it's horrible. I blame you guys, I blame you guys. And I can because I'm just gonna cut this out when I put it on Moodle. So all right, I will do a short break. I will be back at four 10 and then we will continue. Next break is some animal doing funny things. Cows, cows. So I will be back in like 10 minutes. And yeah, yeah, yeah, you're sorry. Yeah, I'm sorry too. Sorry for you guys. You just have to wait longer now before the lecture. Is this an actual lecture for uni? Yeah, it is. So yeah, they're paying me to do this. So I just downloaded the one hour clip from your Twitch recording. Yeah, I'm downloading the whole thing and then just cutting it, but it takes some time to do that. And anyway, I have someone in my office that has a question as well and I will be taking a short break. No, we can't Windsorize this away. All right, you're just stumble on the stream. So welcome to the stream. That's good. We do a little break, funny animated GIFs, and I will be back at 4.10. So enjoy the animated GIFs and then see you in 10 minutes. So I'm back, I had my break. I hope you enjoyed the funny GIFs. And again, welcome to Fail Whale. Nice name, nice name. Yeah, learning some data analysis can't hurt. That's definitely true. I also put all of the old recordings on YouTube. All right, so welcome back. Also the people that are watching it on Moodle or the people that are watching it on YouTube. Last part of today's lecture. So I think we have another 50 minutes left. Still have 12 slides. So I think we will be done a little bit earlier today. All right, so the applications. I already talked about this a little bit, but many statistical models or many statistical tests, they assume that you have a normal distribution in either the input or in the error term. So after fitting some parameters. So for example, the Shapiro-Wilkes test of normality can tell you if your distribution is a normal distribution. So it is built into R. So if you have a vector of numbers, then you can use Shapiro test, give it your list of, or give it your vector of numbers and then it will give you a p-value saying that it is this likely that your distribution is a normal distribution. Besides that, you have the D'Agostini k-square test is a goodness of fit normality test based on sample skewness and sample cortosis. So it's provided by external packages. So if you have a normal distribution, which is still kind of normal, but has a certain skew or a certain kind of kind of cortosis, then this test can even tell you if the distribution that you have is good enough to do a certain statistical test. So R, because it's about statistics and it's the language for statistical computing. The advantage of having a normal distribution is that it's a more powerful test. So you get more statistical power when your distribution is a normal distribution. And this is just because if the distribution that you have is normal or... D'Agostini turned out. But now, if it's a normal distribution, then you get more statistical power because the underlying assumptions of the test, they are true. So that means that you can use a more powerful statistical test instead of having to switch to non-parametric testing which generally uses the rank of the numbers and not the real numbers. So hey, it has a little bit of a drawback of being forced to use non-parametric statistics. All right, so a little bit about plots. We are going to have a whole lecture about plots and making them beautiful. That will be not next week, but the week after that. And if everything goes well, we will also have a guest lecture for like 30 minutes to an hour. And Misha, a friend of mine, will talk about his work where he uses R to create real-time plots to measure closed ecological systems. So a closed ecological system is a tube. And in this tube, there is water and their plants and their little animals. And this system is at an equilibrium or that is the goal to keep this system at an equilibrium. Of course, keeping it at an equilibrium is not good enough because you want to add energy to that system to produce food. And his final goal is to have a closed ecological system which produces more or less net food or net energy. So it can be used on things like space travel. Hey, you can imagine that if you are on a spaceship to Mars, then of course, during the six months that you're traveling, you need to grow your own food, but you want to do this in a closed ecological system because of course, hey, you don't want any outside interference to it. So he uses R together with Arduños to kind of measure this system in real-time and then create plots and see what's going on. So hey, he measures things like CO2 and he will tell you much more about it. So if everything goes okay, then next week we'll have the week off because it's a holiday. And then on the 20th, we'll have first a lecture about plots, so everything about plots and then Misha in the end to talk to you about his work on creating, well, real-time updating bar plots and graphs and these kinds of things. But in R, creating a plot can be done using the plot function, which of course is very logical. So what the plot function does when you call it with a vector of numbers is that it creates a plot window and it figures out based on the numbers that you gave it what the range of the data is. So the minimum value, the maximum value and it will put this on the y-axis and then on the x-axis, it will just be one to the number of numbers and then it will create this dot plot of the plot.