 the recording. Welcome back everyone. Also if you're watching it on Moodle, welcome back. So one sample t-test, very easy. Let me get that. Okay, so one sample t-test, relatively easy, right? You're just taking measurements and then comparing if these measurements are equal to a certain number that you are interested in or that you are is your hypothesis, right? So when we talk about two sample t-tests, we're comparing two groups. So we have the independent and unpaired samples and this is the real original student's t-test. So what this means is that we want to, for example, look at the effect of a medical treatment. So we recruit a hundred people, we randomly assign 50 to the control group, we assign randomly 50 to the treatment group, and preferably the two groups are equal in size and have the same variance. If they are, if the two groups or the control group and the treatment group show a large difference in variance, you have to compensate for that. And then in R, this means that you set the far.equal parameter to being false and then the test is slightly less powerful, but it deals with the difference in variance. If we have repeated measurements, then we are doing a paired sample, right? And a paired sample t-test is when we have, for example, a hundred individuals, all of them get measured before they get the treatment and all of them get measured after they get the treatment, right? So when we do a before and after treatment measurement, then this is called a repeated measurement t-test and a paired sample t-test on average is more powerful because it reduces or eliminates the effect of confounding factors, right? You can imagine that if we have a hundred people and we randomly assign them to control and treatment group, it might be that our control group has slightly more people who smoke compared to the treatment group. It might be that there are slightly more females in the control group than that there are males. So to get rid of this, we can do this paired sample test where, of course, the people that smoked before the experiment and after the experiment generally still are smokers, right? And if we have a hundred people, then, of course, the number of males and females before and after treatment are the same. So if you can design your experiment in such a way that you can do a paired or a repeated measurement test, it is always preferable to go for the repeated measurement test. Although in some cases, you have to do an independent or an unpaired test. For example, in a phase three trial for medicine, you can't or you're not allowed to use paired samples. Although paired samples are better, they give you more statistical power. It could be that the before and after treatment measurements have some issue. So then in medical studies, you generally are forced to use the independent unpaired sample test. And that is the original student's t-test. So like I said, many tests have assumptions or almost all statistical tests have assumptions. And the t-test assumes that your info data follows a known distribution. And these are called parametric tests. So if there is a normality assumption in your statistic, then this is called parametric statistics or a parametric test, right? So the t-test assumes that your input data for both groups, so the before and the after group, follow or the two groups that you define. So if you do a two-sided or not a two-sided, but if you do a two-sample t-test, then the assumption is that both groups, when you plot a histogram, they follow a normal distribution. So the assumption is that each of the two populations being compared follow a normal distribution, and you can test this in R. So if you want to test if a distribution that you have, so if you have 100 measurements and you want to know, do these 100 measurements follow a normal distribution, then you can use, for example, the Shapito Bilk test. So this is called Shapito.test in R. You just give it your measurement values and it will give you a p-value to tell you if it is significantly different from normal or if it's not. So the variances, like I said, between the two populations are assumed to be equal. If they are not equal, so if you have two groups and in one group you have like a variance of five and in the other group you have a variance of two units, then you are not allowed to do a student t-test, you have to do a Welsh t-test. And the Welsh t-test is the default in R. So if you have, if you have equal variance in both groups, then you can use the t-test and you set the var equal parameter to true, and now you are doing a real student t-test. So the Welsh t-test is the default in R. Furthermore, the assumption of the t-test is that your samples are randomly drawn from the population that you're interested in and that they are independent. And independent here is a difficult concept. So generally you don't want people who are related to each other, right? It would be very weird if the people in your control group are me, my uncle, my nephew, my aunt, and all of my close family members, and in the other group it's all random people that are drawn from the population, right? So the three assumptions are normal distribution, samples are random, and they are independent of each other. And of course, these two things can bite each other because making sure that your samples are independent means that you are not really drawing at random, right? If I would have a population, for example, in Berlin, and I would randomly select 10,000 people, then of course there will be brothers and sisters and fathers, and so there will be. So like real randomness is of course not always possible. But you always want to make sure that the samples that you draw are more or less random draw from the population and that the samples themselves are not related or entwined with each other in such a way, so that they are really independent. And also independence, for example, means that you don't want all the people in the control group to be two meters and all the people in the drug group to be one meter fifty, right? Then they are not independent because there's a big difference in height between the two groups. And that of course has nothing to do with relatedness, but independence is a very difficult concept and is something that you cannot always guarantee, and if you guarantee independence, you take away part of the randomness. So it's a balance between these two things. So if your normality assumption doesn't hold, then you have to do a different type of test, right? If your distribution is not a normal distribution, then you have to do a non-parametric test. So the definition of a non-parametric test is that it does not assume a certain distribution of your data, be it a Gaussian distribution or a Poisson distribution. And these methods, they often rely on ranking your data. So instead of using the real measurement values, they don't use the measurement values, but they use the rank of the measurement values. So individuals get ranked from being lowest towards the highest and you do that in both groups and then you do a test to see if the average rank of the groups changes. So, and that is called rank-based methods. So non-parametric tests assume nothing about the data. They do not assume a certain distribution. They might assume other things, right? The assumption of randomness is generally fundamental to both type of tests. No matter if you do a parametric or a non-parametric test, if you did not randomly sample, none of the two tests will be true. So if you want to do a non-parametric test similar to the t-test, you can do, for example, the two group Man-Wit-Ni-Yu test. So the two group Man-Wit-Ni-Yu test is just a non-parametric t-test in a way. For some reason, the Man-Wit-Ni-Yu is actually called Wilcoxon.test in R and you give it your measurements and then you give it a Boolean factor A. So this is the way that you would write it down. So you would say, my measurements are determined by the group. And A is a Boolean factor, which is true or which is false. So it's, for example, case or control or male-female. So you can also do the, so this is the way that you can do it, but you can also use the xy, which is very similar to the t-test, right? So you give it group number one and group number two, so in separate vectors. So you give the function two vectors. You can also do a pair test, right? So if your data is paired, so you have people that you measured all people before the experiment and then you measured all people after the experiment, you can do the same thing. Again, you provide the measurements of the first group, measurements of the second group, and then you add the parameter paired is true. Of course, now you have to make sure that the first element in vector x corresponds to the first element in vector y, because it just assumes that the ordering in x is the same as the ordering in y. So make sure that when you do a Wilcoxon test and you set it paired is true, that the first measurements of x is also the first measurement in y and that the second measurement and so on. So non-parametric variant of a t-test. So if you do a t-test, you test it using your Shapiro-Wilcox test and it says this is not a normal distribution, then you are forced to switch to your man Whitney U test, which is called Wilcox test in R. If you have more groups, right? Because t-tests only works on two groups, but it might be that you have three groups. You might have, for example, people under the age of 20, people between 20 and 40, and people above 40. Then you can do a Kruskal-Wallis test. So this is a non-parametric test. It is also called a one-way ANOVA by ranks. So it's kind of a linear model already. It also uses the linear model way of writing things down. But the hypothesis here, so the null hypothesis is that all groups are from the same population. So it won't tell you which group is different. It will just tell you that one of the groups is different, right? So a Kruskal test you can do by taking your numbers, so your measurement values, and then creating a factor. And this factor, of course, can have many different levels. So not just two, but also three or four. Of course, when this test is significant, then you have to fall back on doing man Whitney U test to test which group is the group that is different. So if you use the Kruskal-Wallis test, then this is your statistical test. And then it has to be followed up by a post-hoc test. So the post-hoc test for the Kruskal test is often the Wilcoxon test, so the man Whitney U test, right? Because this will only tell you that there is a difference. So one of the groups is different. To figure out which group is the one that is different, you have to do pairwise testing of every group versus every other group using the man Whitney U test. All right. If we are dealing with, for example, a blocking factor, so a blocking factor, and to get rid of noise or variance in your data, you can do two things. You can randomize or you can block. So when you use a blocking factor, for example, I have potato yield of different types of potato plants. So a is my grouping factor, so it's the factor which is the same as the a here, right? So different groups. So I have different groups of potatoes, for example, different species. And they have been measured across different fields. So field here is my blocking factor, right? Because I put potato type 1 in field number 1 and I put potato type 2 in field number 1. And then on field number 2 I do potato type 2 and potato type 3. And on field number 3 I do potato type 3 and potato type 1, right? So now the idea is that by having the same type of potato in different fields, you get rid of some of this variance which might be introduced due to the field kind of random factor. So we then do a Friedman test and the Friedman test allows us to compensate or to include a blocking factor into our test statistic. And so we can now say that y is our potato yield is determined by the group to which the potatoes belong. For example, the potato species that we have planted. And b is then for each potato the field at which it was planted, right? So we can take one field, we divide it into five sections and in section number 1 we put group number 1 and so we randomize out some of the variance by having not all of the potatoes in one single field and then just trying to analyze the result. All right, so those are the kind of tests that I want to discuss today. We will be discussing like the real linear models in the next coming weeks. And we will discuss linear models into great detail. So we will not just look at linear models but we will also look at generalized linear models. We will also look at linear mixed models where you include random effects. But for now I just want to kind of stop here and talk about some other kind of measurements which are used a lot. So one of them is correlation. I think everyone should know by now what correlation is. So correlation is a measurement of dependence between two variables. Is one variable dependent on the other variable? So in R we provide this correlation, it provides the correlation function. We already used it but I just wanted to specify it again. So the thing that people always say right is that correlation does not equal causation. So if you find that two things are correlated it does not mean that there is a causal connection between the two. However I would say that finding a causal effect without finding correlation is almost impossible. So saying that correlation does not equal causation is true. But the opposite causation equals correlation is also almost always true. Because it is very hard to have a causal system where the causality does not introduce some kind of a correlation structure between two variables that you are looking at. Of course correlation itself is not causation. Why do we still bother with calculating correlations? And that is because correlations are very useful because they can indicate a predictive relationship that can be exploited in practice. Imagine that I know that stock prices go up when the weather is bad and stock prices go down when the weather is good. Then I know that there is a correlation between the temperature outside and the price of a certain stock on the stock market. For example ice cream stocks. There might be a real causal relationship there. But if I know this I can still invest my money. Smarter than without knowing this. So correlations you can exploit them in practice and they are exploited all over the place. Because if you know that two things are correlated then if you have a prediction of one of the variables then you kind of indirectly know what will happen to the other variable. So that is why correlations are used so often. Not because it implies causation but because it is a predictive relationship that can be exploited. So there are three types of correlation that R provides. I will only describe two of them. Pearson is the standard in R. However Pearson correlation is again a parametric test. So it assumes a normal distribution. It does not do any transformation on your data. It just uses the numbers that you have as is. So it is fast but it is sensitive to outliers. So if your distribution is not a normal distribution you should not be using Pearson correlation. Instead of that when you don't have a normal distribution you should go for Spearman correlation. So Spearman correlations are rank based transformations. Again instead of using the measurement value we use the rank of the measurement. It is a little bit slower to compute but it is more robust. And of course correlations are not the end all to everything. Because as you can see a correlation of 1 can have many different shapes. So a correlation of 1 can be a very strong relationship. It can be a relatively weak relationship or it can be a relationship which only gradually increases the second variable. So all of these are kind of distributions where you have to put an x and a y-axis here. And not all relationships are caught by correlation. For example if your data on the x-axis and on the y-axis when you plot the one variable versus the other variable looks like this then there is a definite relationship between these two variables. But the correlation of this will be zero. So correlation is a useful measure but it will not catch a lot of the structures that you might be interested in. Right? If you plot two variables against each other and it looks like this kind of an o-ring then you know that there is some kind of a relationship between these two variables but this relationship is not caught by correlation at all. So correlation just looks, doesn't increase in one variable, leads to an increase or a decrease of another variable of interest. And it works really well but there are situations when there is a relationship when correlation is not able to pick this up. So an example of correlation, how do I compute it? Well for example imagine that I have the ice cream sales and temperature and then the temperature goes up, ice cream sales go up. So how do I compute correlation? Well the first step to doing correlation, so this is the formula for correlation, so you have A and you have B. So the first thing that you do is you calculate the mean of the temperature and you calculate the mean of the sales. So what you then do is you have your means here and then what you do is you then subtract the mean from your original measurement value. So I take temperature and I subtract the mean of the temperature so I take 14.2 and I subtract 18.7 leading to minus 4.5 and I do this for all my measurements that I have. I do the same thing for the sales temperature and this is my B component here in this formula. So then what I do is I calculate then what happens when I multiply A with B, what happens when I square A, so when I multiply A with A and what happens when I multiply B with B. So what I do is I calculate all of these three parameters and then what I do is I then have the four so I sum them all up. So I calculate the sum of A times B, the sum of A square and the sum of B square across all of my measurements and then I have my three numbers of interest. So I have my A square, I have my B square and here I have my A times B and then I can just do the computation. So I take the A times B component and I divide it by the square root of the A square component times the B square component and this will give me my correlation. So in this case the correlation between the small example of ice cream sales versus temperature is 0.95. You don't have to remember the formula but I do want you to kind of understand what's happening in the formula. So for example on the exam I could give you the formula and then give you like temperature and sales and then you should be able without a calculator to calculate my correlation coefficient. I'm not saying that I'm going to do that during the exam, I'm just saying that I want to keep it as an option but I don't want you to remember the formula by heart. If anything if you need to calculate during the exam the formula will be given and you just need to be kind of able to interpret the formula. So we have the summation, we have the square root and you have to kind of see that here we do like an xi minus x. So that means that you calculate the mean of a group and then you subtract the mean from each group member. That's what this means and that's why the i to the power of n is there. So you sum them all up and that's the summing up of the whole column when you multiply a times b or you sum them all up when you do a times a. I hope that's clear and not too hard, right? It's something that you could easily do and you don't really need a calculator for that. Perhaps we're summing it up but I will choose the numbers in such a way that you can just add them up in your head without having to use a calculator. Alright, so Pearson correlation the slide before, right? There was no transformation. Variables have to be linear related for an r to be exactly one. It is fast but it's very sensitive to outliers. Spearman correlation is a rank-based transformation so that means that the variables have to not be linearly related but they have to be monotonically related. That means monotonic means that an increase in one causes an increase in the other one. It's slower but more robust. So to show you what the difference is between a linear related variable and a monotonic related variable, this is, these are two variables x and y and what we see here is that every time that we increase in x, we also increase in y. So if we take a step in x, then we also take a step up in y. So these two are monotonically related which means that the Spearman correlation in this case will be one. They are not exactly linearly related because a step from 0 to 0.2 in x is an increase from minus 12 to minus four or no minus two and a half. While an increase from 0.2 to 0.4 is a much smaller increase. So the linear relatedness means that an increase in one causes an equal increase in the other one. While monotonically related just means that increasing one will increase the other one but the effect size of the increase doesn't have to be the same. So difference between Spearman correlation and Pearson correlation. All right, so we talked a lot about statistical tests. We talked about correlation a little bit. So now let's go back to the microarrays. We want to test our gene expression data. So the measurements that we got on all of the genes in the genome and we want to now say if, for example, there is a significant difference. Imagine that we have gene A measured in gonadal fat but also in hypothalamus. Now we can ask the question is gene A significantly differently expressed in gonadal fat compared to hypothalamus. So we always agree that our p-value should be lower than 0.05 because we agree that an observed difference in phenotype is significant if it happens one in 20 times by chance. So 19 measurements should show a difference. One measurement does not have to show a difference. That is our agreement in biology. However, what do we accept as significant because we are performing many of these tests, right? So commonly we test like 20,000 genes. And so as the number of comparison increases, it will become more likely that the groups being compared will appear to be different in terms of at least one attribute, right? And that's just the way. Like if I test a single gene, then of course with 0.05 I have a 5% chance that I am wrong. But if I test 100 genes, then of course for each of these genes I have a 5% chance of being wrong. So after doing 100 tests, I will have said five or the test will tell me five times that there is a difference while there actually isn't because it's just the error rate that we assume. So what can we do to preserve this one in 20 thresholds or alpha level of 0.05? We need to compensate for the amount of tests that we perform. So the simplest correction that you can come up with is the Bonferroni correction, which just says that, well, instead of my p-value being lower than 0.05, my p-value now needs to be lower than 0.05 divided by the number of tests that I did. So for a normal microarray experiment, that means that your p-value, so the p-value or the difference needs to be smaller than 2.5 times 10 to the minus 6. It's just 0.05 divided by 20,000. And of course, this is much more stringent because now only once every 10 to the minus 6, that's only once in like two and a half million tests, will we make an error? So will we say that something is significantly different while it is actually not? So this is called multiple testing. So microarrays measure the expression of 20,000 genes. So we do literally thousands of statistical tests, especially if we have more than two groups. So there's a chance that we make a type one error. And a type one error means that we call a gene significantly changed, but it was just by chance. And we can avoid this using Bonferroni correction, right? So we can just say that instead of taking 0.05, we take 0.05 divided by 20,000. There is also the type two error. So this is a false positive. But there's also the type of error where we have a false negative, which means that we miss a gene, which is really significantly different. But we say that it's not. And we can actually avoid using, we can actually avoid this type two error so we can minimize the type two error by using the menu mini Holger false discovery rate. So how do we do this in R? So imagine that I have my p-value calculated and I want to adjust it. I can use the p adjust function. So I can say p adjust the p value that I have obtained using the Bonferroni method. And I did 10 tests in total, right? So now it corrects the p-value to when what the p-value should have been if I had only done one test, right? So 10 is the number of tests that could be performed on the genes we measured. And so adjusted p-value below 0.05 are considered significant. So what this will do, it will multiply your current p-value by 10. So it will go from 0.0015 to 0.015. And that means that we now can use our standard 0.05 threshold again, right? That's that it just adjusts the p-value. Of course, instead of using Bonferroni, we can also use Benjamini-Hochberg, which is called BH, and then it will use the Benjamini-Hochberg correction procedure. So this will not minimize the number of false positives, but it will minimize the number of false negatives. All right. So that's more or less it for today. If you want to get some free microarray data to kind of do some statistical tests, to get like a matrix of measurements, for example, for healthy lung tissue and lung cancer tissue, and you want to start looking into which genes might be involved in developing lung cancer, then there's two major databases available. The first one is the gene expression omnibus, which is ran by NCBI. So you can just type into Google, gene expression omnibus, and then it will take you there. So it contains around 25,000 experiments, and they've stored around 600,000 microarrays. They only provide storage and retrieval. There is no quality control on this data. So everyone can upload their microarray data there, which is perfectly fine. But of course, what some people call x, other people might call y. So you might have a microarray, which is done on similar tissue, but some people call this gonadophat tissue, and other people will call it white adipose tissue. So there's a difference in naming, and that's not harmonized in gene expression omnibus. But they provide a massive amount of microarray data. Besides that, you also have Array Express, which is ran by the European Bioinformatics Institute. And they have around 24,000 experiments stored. They have around 700,000 arrays in there. But the nice thing is, is that they have something called the gene expression atlas. And these are 5,500 experiments, which are curated, re-annotated, archive data. So they have their whole data set available for you to download and to look at and to play with. But they also have a very core subset of this data, which they manually looked through to make sure that lung cancer tissue is really lung cancer tissue. That if a microarray is done on a certain type of mouse, that it was really done on this type of mouse. So they spend a lot of money and a lot of man hours to go through their data and re-annotate their data, to curate it, to make sure that everything is correct. And that there's no missing annotations and stuff. So this, the gene expression atlas, is probably one of the most valuable data sets that you can get in genetics or in bioinformatics, where you can work on and where you can get free data. And there are a lot of high scoring publications still hidden in this data. So just going through that data, downloading different data sets and comparing them, you can still make novel findings which people have never found before or which people have not published yet. And the nice thing about the EBI itself is that they do storage and retrieval. But there's also basic analysis tools which you can run more or less online already to see, for example, the differences in biological conditions or the differences across experiments. And this is really useful that beforehand you can get an idea of how good the microarray data is. If the gene you want is on the array, so you don't have to download like gigabytes of data before you can do and play with it, you can already play with it on the website of the EBI. But again, the gene expression atlas is a very valuable research and there's probably still some nature in science papers hidden in this data because it's literally so much data that no one actually takes the time to kind of go through all of this old data. But there's some really good gems still hidden in there. All right, so that's it for me for today. We finished 20 minutes earlier, so I could have taken a little bit more time. If there's any questions, just throw them in chat. I will be here for a couple more minutes. So just AMA ask me anything. Questions about like the statistics that we discussed. I also had two things more written down. I got a mail from Super David. No idea if Super David is here. But I got a mail that he wanted to and let me look it up. Have a lecture on loading, loading, loading. So like I offered you guys already a couple of times that if you have a nice data set that you want me to make a lecture about or if you have a very specific topic that you're interested in, that you should just send me an email. So he actually sent me an email and he made some topic proposals for a lecture that we can have. So I just want to mention them. So things like iteration methods for determining sample sizes, using a curriculum package for experimental design, for example, how to analyze Latin square or other types of designs, tools for experimental design, and perhaps a topic about AI or TensorFlow would be interesting. So again, if you guys have a very specific topic where you're interested in and think like, oh, that's a good topic to have like a three hour lecture on or I really want to know more about this, then definitely let me know. Because I generally have two lectures at the end which are up to you guys. So you can say, well, I'm interested in this. And one of them is already taken because, and I have to look that up and I'm not going to do that, but one student already sent me a data set about fish. And he wanted to have me look at the data set and he was perfectly fine with me using his data set as an example data set. So that already takes up one lecture. But if you still have ideas, and of course, things might or might not overlap with super David, his email, then just let me know. So hey, you're more than welcome to say, well, I really want to learn something about machine learning, or I'm really interested in random forest. How does it work? And how can we use it? Or I'm working on bacteria and I need to do predictions or polycystronic versus monocystronic transcription, like it doesn't matter to me too much what it is. As long as I have some knowledge in the field, I'm more than willing to spend a couple of hours to make a nice presentation for you guys. And the more you kind of give me ideas, the more interesting it will become because of course, if five people send in the same idea, and then of course, it's much more likely that I will do a topic on that. So there's already like four ideas in his email. But if you guys have like an idea, like I want to know more about x or y, then definitely send it to me, I will just collect all of them. And then at a certain point, I will decide, okay, so most people want to know about machine learning, or they want to learn about artificial intelligence. And then we will have a topic about that. So think about what kind of a master project am I going to do, or I'm now in my PhD, and I have a kind of PhD data set, and I want to apply this very specific method that I want to learn more about. Then let me know, just send me an email, throw it in chat, and just let me know what you guys want to see. All right. So no questions the last hour. Are we all still awake? So just do a whoop, whoop, and chat. Still awake. Everyone's awake. Everyone's asleep. Even my moderator fell asleep. Moderator, where are you? See, everyone's sleeping. Ah, Skrita's still awake. That's, that's good. Okay, my moderator's also still here. Oh, whoop, whoop. Awake. Okay, good, good, good. So everyone's still here. Everyone understood. So if on the exam, I ask you guys, what's the difference between likelihood and probability, you can kind of write that down for me. Then if I ask you to calculate correlation and give you the formula, then some more people awake? Probably, probably, probably. That sounds like a kind of statistical answer. There's like a 5% likelihood that you will be able to write down the correct answer. Yeah, you didn't pay it. Of course, you're my moderator. You're just here to kick people from the chat that are annoying. You don't have to do the exam. Like, you get your Leistungspunkte for free if you want to. I can just write you a Leistungsschein. It's already the second time that you've been attending the lecture, so that that's a pretty nice and scary looking, you don't say, Privatia 1995. That's a scary looking emoticon. So we also have the Denny one, right? So the pandemic, me. All right, so for the guys watching on Moodle, I will.