 Testing. One, two, three. Can people hear me or not? I hope people can. If not, then that's an issue. But data analysis using R, lecture number six. We're going at a very, very good speed. Which is the normal speed. Good. So still two minutes to go. So I'll just do my radio thing and say hello to everyone. Hello, Nicolaus. Hello, Misha. Hello, Davide. Hello, my moderator. So today we have something special. I have during the first break, we have gifts of my own cat. So not random gifts from the internet, but unique gifts from my own kitty cat. Why the ghost? Oh yeah, the ghost will come back. So that's going to be a surprise. But it is all related to the lecture. So you might think that ghosts are not related to doing statistical tests, but there is actually a reason why we do ghosts. I hope Han Solo is informed. I did not inform him, but I don't have the time now to jump onto the chat thing and tell him. But he might drop by, and if not, then I will just tell him later on. Han Solo. No, no, it's one of the students from one of the previous courses who has been instrumental in promoting the course under the fish students in the last couple of years. So from Star Wars. No, no, not from Star Wars. All right, let me switch to the lecture layout so that people can see me as well, because it's two. So welcome, welcome everyone. Thank you guys again for being here. This would not be possible without you guys. So today we will talk about statistical testing, but like always I first want to welcome you guys, that you are here. Always think it's amazing that people take time out of their day to watch me. Not just people that follow the course, but people from everywhere. Like I've been doing a little bit of promotion on Facebook for telling people like there is this free R course that you can join and there's the old R course that people can watch, and it's actually been very good. Statistic wise we have improved and we actually reached 1,000 subscribers. And I never, ever, ever in a million years that thought that that would happen. I had never expected to end up with a thousand people clicking the subscribe button and liking and these kinds of things. So I'm really, really honored by that. You befriended a lot of spammers. Yes, yes. And that is that I am very, very sorry for that. I'm sorry for all my real Facebook friends that they might get spammed by people from India, Bangladesh, Nepal and these kinds of things. But I did have some good interactions with some of the spammers. Let's call them like that. But yeah, I'm very sorry for that. So with that out of the way, so like I said, everything that you see here will come back except for the standard little figure. Although it will come back in some way, but the ghost will come back. And I just wanted to explain why we don't have a little sun but a moon because we are going to use microarrays as an example. And a microarray is like a fishing expedition. So I draw the moon and a little guy fishing with a piece of DNA. And that's kind of my preferred way of kind of visualizing how microarrays work. But with that out of the way, let's start with the first slide. So the first slide is of course, let's take a look at my answers for lecture number five. So I hope everyone was able to do the assignments. The assignments this time were tough. Not so much because it's about plotting, but I think it's tough because there's a little bit of genetics knowledge involved as well. So I hope that everyone was able to do it. If not, then I will do them now probably live since every time that I ask, like, should I just show my answers and go through it or should I do it live? People ask me to do it live. So I'm perfectly fine with doing it live. And it adds a little bit to the stream as well because people can see me typing and stuff. For that, I actually have to take my drawing board and put it somewhere else so that I have a little bit of space so that I can reach my keyboard. So we'll do the answers. So does this work? Yes, this seems to work. So that means that you guys should be seeing Notepad now. So for the assignments, there were two or three data files that people had to load in. So I already prepared a little bit. So I just added a header to the file, which you should always do. So write down what the file is going to contain. Write down who you are so that people know who is the author. And then question number one, of course, is always every script that I write starts with setting your working directory just so that other people that use it know that, okay, I have to change the working directory. And if I use libraries, I put the libraries first so that people know, okay, so I need to have these libraries installed in R, and then I have to set my working directory somewhere where I have read and write so that I can load in files. Good, so the assignments were about plotting. So the data in the genotypes.txt is a subset of the data we collected using the MegaMugaMouse genotyping array. For the different advanced intercrossed line individuals, the weights are measured at several days, which can be found in the phenotypes.txt file. The genetic map of the markers used is found in map.txt. And then there's this notice saying that please note that this data is from a current analysis and is not public yet. That is actually an old notice because the data has been published now. So first question is load in the data. So of course we have three files. So I'm just going to write three times read.table. And I'm going to just duplicate, duplicate, duplicate that. And I think the first file is called genotypes.txt. The second file is called phenotypes.txt. And I think the third file is called map.txt. Right, so I'm just going to store this in a variable called geno. This is going to go in a variable called pheno. And this is going to go into a variable called map. All right, so let's save it, copy and go to R. Let me see because the assignments are on top of my streaming window, which makes it a little bit difficult. So for some reason the R window is not on top. So that should be fine. All right, so let's just, oh, that's something else. That is something else. That is something else that I had in my thing. So there's no such file. So let me check if this is actually the proper working directory where I should be. All right. And then, yeah, so I actually forgot to add the assignments 5.data extension to it. So let's go back to Notepad. And let's add that to the path where the files are stored. And then we copy it. We go back to R and we just paste it in. All right, so now we look at Gino. Let's just ask for the dimensions first. 3000 seems to be OK-ish. Let me actually check out the file to see if the headers and the row names are loaded incorrectly. Seems to be OK-ish, but again, like there's this X in front, right? So because R likes column names to be proper variable names, it starts numeric names with an X. And then if we would look into the Fino file where I think, oh, it doesn't have 10 columns. So here you see that the Fino-type file doesn't start with an X, right? Because these are row names. So the row names do not get fixed, but the column names get fixed into proper variable names. So we have to make sure that that doesn't happen. So for the Gino, we can just say check.names equals false. So just read in the names as is. And then we go back to R. We load it in and then we go up, up, up, up, up. And then we just take the code that we already had. So let's look at a little bit. So now the names here are proper, right? Now they are exactly like they are in the file. And they match between the Gino types and the Fino types. So let me actually look at the Fino-type file as well again. So yeah, names seem to be matching. And of course, we also want to check out the map file. So let's just look at the first 10 lines of that. And that seems to be properly as well. We can see that the map is unordered, right? So it's chromosome 1, 7, 4, 14, and these kinds of things. So we might need to sort this as well using the sort function. But for now, let's just continue with question number two, right? Because question number one was load in the data. We did that. The data is loaded in. We can see that it seems to be kind of okay. Let me check actually the old answers to see if I'm forgetting anything. No, seems to be all fine. Alrighty, so next step. Basic curves. Let's analyze some of the data. We'll start with the Fino types. The D and then a number measurement stand for the day of weight measurement. The column WG2 is the litter size. Of course, this is called in German the worthgroße. So that's why it's called WG. After two days, since in the first couple of days newborn mice tend to die or get eaten by the mother, which is a very common phenomena. So if you have mice or if you breed mice, then if you have a litter, then the litter tends to be like 12. And then if you look a couple of days later, there's only eight little pups left because the mother tends to eat the weakest ones. And that's perfectly normal. That's just like natural selection. So to be sure to set up the drawing area of the different plots beforehand using plot, then add the data using the points function. So question 2a is create a global growth curve showing the mean weight for all animals combined, try out different plotting types and select the one that you like best, right? Because that's always the case with plots. You might like plots in a different way compared to how I like my plots. But first, we're going to say we're going to answer question number 2. So first things first, I want to get from the phenotypes the different days, right? So if we go to R, then we can actually see that here we have measurements at day 21 all the way to day 70. So let's deal with that first, right? So I'm first going to define a variable and this variable is going to be days, right? At which we are measured. And this is just going to be a paste zero of the letter D. And then I have a sequence, right? Because we measure them every week starting at 21 days all the way to 70 days and we measure them every week, so every seven days, right? So if I would do this and I would go to R, then now we would have days and days would be 21, 28 and so forth, right? And we can now use days in the phenotype vector. So if I would say phenotypes, give me for the first 10 individual the days at which we did the measurement. Sorry, Fino. Then we see that we now have a numeric matrix, right? So we get rid of the VG2 column, so the worth-gross column and any other columns that we might have measured. That's good. All right. So now we want to do some plots, right? So the first is easy, right? Because we just want to do the weights, right? So we just take the weights and then we want to calculate the mean and of course we want to calculate the mean for each of the different columns. So let me just go to Notepad. So what do we want to do? Well, I'm going to say only weights is what I just did. So that's Fino and I'm only going to take the days. And then I'm just going to say, well, plot, right? Only weights. And now this would plot the whole matrix, which it cannot do. But I want to get the means, right? So I'm just going to apply to the only weights, to the columns, the function mean, and I'm going to directly say na.rm equals true, right? If a mouse doesn't have a measurement, it should not be put into the mean. So something like this. So if we then just go to r, we copy-paste this in, then we hope to see a plot. And this looks like a pretty standard plot, right? So there's a little bit of things that are wrong with it because here you see that the one is called index, so 1, 2, 3, 4. So this is actually day 21, day 28, and so on. So we have to fix the x-axis. And here we see the body weights of the mice. We see that it's a normal growth curve. So they grow very fast at the beginning. And then the older they get, the slower they start growing. So generally this is kind of a standard growth curve. So let's go to Notepad again and try to fix our plot a little bit. So I will say x-axis is non, so don't do an x-axis. The y-label is going to be weight, and that's going to be in grams. And at the x-label, I want to say that this is time in days. So that already should look a little bit better. And then I'm going to add my own axis, and then I'm going to say add is 1. So at the bottom right, add 1, 2, 7, right? So I want to add 1 to 8. What do I want to plot there? So I want to just plot the days there. All right, so let's go back to R and improve the plot a little bit. Oh, add is 1. So I'm just going to say... All right, so now we have the days here. And the thing that I actually want to fix as well is that here we see that these numbers are like this. I want that we want to have them normally. So we can do that using the las function. So when we go back to Notepad, I can actually... Let's put this on two lines, keep it a little bit shorter. And I'm going to say las. So the rotation of the axis, I'm going to put that to 2. And then that should be fine. And I want to give it a main as well. So I'm going to say main equals a mean body weight. Let's try out a different plot type, right? So I'm going to say t equals b. So type is both. Show me the circles, right? But also draw a line through them. So all right, let's go to R and let's do the plot. All right, so here we see more or less the plot that we expect. We see that mice grow very quickly in the beginning. Oh, let me press enter for the axis. The add is 1 should go. And now we have a plot which looks reasonably okay. So from this, we can learn that at day 70, the average mouse is around 40 gram. At day 21, it's around 10 grams, slightly above 10 grams. And it grows very quickly. And then the growth starts subsiding. Good. So that was question 2a. So let me go back to Notepad and say question 2a. And then we have question 2b. So question 2b. Add to this plot the line showing the variance, variation deviation at the measurement day or error bars that show the same thing. See the sketches below. Don't forget to add the legend to describe the color and add a good figure axis and main title. So just make the plot a little bit better because we have all of these measurements at each of the days. We can actually now start computing also things like standard deviations. So let's just decide that instead of the variance or the standard error, we're just going to add the standard deviation. So first off, I'm going to take the means out because I don't want to just apply and these kinds of things within the plot function. So I'm just going to say calculate the means, put them in their own variable. And actually we can already put this in 2a. And then here I can just remove this part and now say plot the means. And then for 2b, I'm just going to duplicate this line. Oh, I just changed my keyboard layout because of the fact that I pressed two keys. So I'm just going to press Ctrl D. I'm going to say question 2b. I'm going to calculate the SDS. So the SDS, right, standard deviations. So instead of applying to the weights, to the columns, the mean function, I'm just going to apply the standard deviation function. All right, let's go to R, throw this in so that it does the computation. So now, I still didn't fix that little add error. So let me do that as well. So now if we look at the means, right, then here we see the mean body weight. And if we type SDS, then these are the standard deviations. So now we can do two things, right, because we can just make a really nice plot ourselves and add all of the lines that we want. And I showed you two different ways of doing this. So in the assignments, there's two little graphs showing the differences. And let's just make the first graph first, right? So the first graph has this error bars. And the error bars are actually a little bit sneaky because we can actually use a little trick here. So let's just first say, we want to set up our own plot window, right? So I'm going to say plot C1, the length of the means, right? So that's going to be our x-axis. So at the x-axis, we're going to do from 1 to the number of means that we have. The y is going to be from, well, we saw that it's from around 10 grams all the way up to 40 grams. If we go back to R, then we can see that it's 39.8. And then there's a variance of 6.6, right? So if we plot the y-axis to around 40, 45, 47, then the error bar should be able to fit. So let's say we go on the y. We go from 5 at the lowest. And then we go to like 47 at the top. The main is again mean body weight plus standard deviation. I am going to say type equals non because I want to make an empty plot window. And I am going to directly disable the x-axis as well because I don't want to have the x-axis. I want to make sure that I make the x-axis myself. I'm going to add the axis directly in the second step like we did before. And I'm going to also just copy the labels for the y-axis, the x-axis, and I'm going to flip the numeric or I'm going to flip the values as well for the x-axis. So let's just copy paste this, right? So we now have more or less the same command but now I'm setting up my own more or less limits, right? So I'm going to say x runs from 1 to the number of means that we have and the y-axis is going to be from 5 to 47. And y5 to 47, I could have done this a little bit differently as well but like this is perfectly fine for now. Alright, so let's go to R and now we have an empty plot, right? So now the only thing that we have to do is to draw three lines to create the plot or to draw a line and then at every point we want to add the standard deviation into the plot. So let's do that. So I'm first going to say points and I'm just going to draw the means. I'm going to say type equals line because I want to have a line plot and let's see how that looks, right? So we go back to R, we do the same plot again. So this is what we saw before. So this is just the growth curve. And now I want to add the error bars. So the error bars are a little bit, say you can use a little trick and I hope you guys figured out this trick because if you would have gone through most of the options for plotting you would have seen that there is a type which is h, right? So let me go to R and actually make this plot using type equals h, right? And now you see that it uses these bars, right? So these horizontal bars. That's what h stands for. Instead of drawing a line which is type l, you can use type h and that will draw these bars. So to make error bars, we can say, well, I'm going to draw the means as type line, right? Then I'm going to plot the means. I'm going to do points of the means plus the standard deviations. I'm going to say type equals h and I'm going to make these, for example, orange, right? Because I like orange. And then I'm going to say, well, the points from the means minus the standard deviation. I'm going to also plot that as h, but I'm going to make them white, the same color as the background, right? And now let's show you guys what happens. You see that now I have error bars all of a sudden, right? Because it draws an orange line from zero all the way to the mean plus the standard deviation. And then it plots a little white line from zero to here, right? So to the mean minus the standard deviation. So it's kind of tricky, right? Because it's over-plotting. But now we see the error bars, right? Of course, we could have added or we could have made the second type of plot as well using more or less the same structure. We could have said like the points mean plus standard deviation color is, for example, blue. And instead of type equals h, we do the line again and we do the same thing here. So we make a line and we also make it blue, right? And now when we go to R and we copy paste this in, then now we get this kind of funnel curve around it, right? So now we see that the measurements that we have are more or less, most of the measurements should fall between here because it's the mean plus a single standard deviation. So that means that like 75% of the data should be between these points. All right, so that's how you make these types of plots. So a little bit of a trick, right? Because you can use white and since R uses the artist-pellet model, if I draw a line and then draw another line on top of it using the background color, then that will go on top, right? So it's the artist-pellet model which we're kind of abusing in this sense. But it helps us to draw these standard errors. And of course I could just disable the plotting of the orange ones and then the plot would look exactly like the one that I asked you guys in the assignment. Good. To see, create a new plot that shows the median at each measurement date in a light gray line and all the measurements of the different individuals. All right, so I didn't give a little example for this, but like in this case we want to show the medians, right? So we're going to do the exact same thing as what we did here. So I'm just going to duplicate this line here. I'm going to drag this all down. I'm going to say a little bit of extra space for you guys and hashtag question to C, right? So instead of means, I'm going to call these medians and I'm going to apply to the only weights to the columns the function median. So there's my medians, right? So for every point. So I'm just going to do the exact same thing that I did before, right? So I'm going to do my plot. I'm going to set it up myself. In this case I'm going to go to the length of the medians and I'm just going to plot the medians. And in this case I'm going to do a line and I'm going to do a line-wide LVD is 2, right? So a little bit fatter because I want to plot all of the other individuals here. All right, so how does this now look? Let's plot the mean as well. So that you can see what the difference is between the mean and the median and this is going to be line-wide 2 as well. Color is red. Wait, I'm forgetting something. I didn't add the legends and stuff. So I hope that people understand that you can just use the legend function to add the legend. Let me quickly do that to this plot, right? So I can say, give me a legend, right? And the legend should be on the top left because the curve goes like this. So I have some space on the top left to plot a legend. And I'm going to say, well, I have the mean. I have then mean plus standard deviation, mean minus standard deviation. And the plot also contained the standard deviation themselves, right? And then I have to add color. So I think we did the mean is going to be black. Then the lines surrounding it were blue. Then there was another blue line. And then the last line was orange. And then we close the whole thing. And then I'm just going to add this. Give it a little bit of a layout. So the mean is going to be black. The mean plus the standard deviation was blue. Mean minus standard deviation was also blue. And the standard deviation itself is orange. Just to show you guys that you can easily add a legend into the R window. And then we can see now that... Of course, I have to say that L, V, D is one. If I go to R, I have to add that I want to have a line. So I think this should work to do it. But now it's over-plotting because it's the R to spell it model, right? So I just have to remake the whole plot. So now you can see it's annotating all of the different lines. So we have a little bit of a legend there. It's not the most beautiful plot in the world, but it's functional, right? And that's what we're first aiming for. All right, so let's go to the medians, right? So the medians are more or less the same. So let me show you. So just compute the medians and then make the exact same plot. And you can see that at the beginning, the mean and the median are exactly the same and that continues on for more or less the whole line. So that means that the body weights of these mice, of the whole population of mice that we have, it is relatively normal, probably, right? Because if you have a normal distribution, the mean is the median, is the mode, so that if the lines are very close, if these lines would be very different from each other, then I would probably have concluded that it might not be a normal distribution or there might be some outliers in there. But since we see this, we learn something and we learn that probably body weight is a normal distribution. Of course, I have to check this using box plots or violin plots, but in this case, I'm not going to put the effort in. I just wanted to show you guys that they are more or less similar. All right, so the idea now is, is to plot all of the individuals, right? So I'm going to first do that. So I'm going to just say 4x in 1, 2, the number of rows of only weights, right? Because that's where we have our data. I am going to just plot the individual using a line, line white is 1, and I'm going to say color is light gray. And this will, and I don't want to plot the medians, I just want to plot the column x from only weights, right? So just use the variable, I'm just going to take the individual body weight measurements and I'm just going to plot all of those. And then at the end, I'm going to plot the median on top of it and the median is going to be using line white 2. So it's going to be a little bit stronger. And I hope that that's visible. All right, let's copy it, go to R, and let's see how that looks. So for some reason it didn't do this. So what is wrong with only weights x? It's probably not a numeric value or light gray is not visible. So let me just get the points, right? Like this, and that doesn't work because it doesn't show anything. So I'm just going to say as numeric, just to make sure. And now I can see that, yes, when I do an as numeric, so it did not interpret the values as being numeric values for some reason, but that's just the way that R works. The R-type system is weird sometimes. So although it seems like they are numerical values, R did not interpret these as numerical values. And that's because it's actually a little matrix, not a vector because it's a matrix, because you can see it still has column names and it still has a row name. And of course, if you try to plot a matrix, it cannot do that. For some reason it doesn't error out. So let's go back to Notepad and add this as.numeric through it, just to make sure that every time that we plot the values, we really convert them to numerical values. So let's go back to R, we plot those, and now we see something like this. So we see that there's a massive spread. So these are all the individual growth curve. You can see here a mouse, which at day 35 probably became sick or something and it didn't grow that much. You actually see that some of the mice actually lose weight at the end of the experiment. And we also see that our maximum range of the plot is way too small because there are mice, which actually at day 49, they are already heavier than like 47 grams. So let's go back to R and I'm just going to update my plot window to like 65 and just to see if I can have all of the values fit in. We go to R, we make the plot again, and now we see all of the individual growth curves and we see kind of the mean growth curve here. Of course, we can do the same thing, little trick again, right? And so we can just add, I'll show you guys this one. And so we could just do a duplication of this, draw it all the way here, and in this case I could say, well, show me the medians minus the standard deviation, the medians plus the standard deviation. This one is going to be red, and the other one is going to be white, right? So just adding the standard deviations around the median values and if we now go to R, then now the plot looks like this, right? So you can see that indeed most of the individuals measured fall within the standard deviation, some individuals are much, much bigger, some individuals are smaller and they start losing weight, or they start losing weight. Good, so now we learn, or we learn something, and we are just looking at our data. We're just visualizing the data, kind of trying to get a grasp of what's going on in the body weight measurements that we have. All right, but that was question to see, right? Because now we've created a plot that shows the median at each measurement date and in light gray all the measurements of the different individuals. So every individual was in this plot, we see the medians, but now you can see the white lines. Yes, yes. And that's just the way that it is, because this is not a perfect trick. And of course we could do it differently, right? We really use the lines function to plot them, which is one possibility. It's just to show you guys. Like the question didn't ask for it, right? But yeah, you can actually get them away because you could just say, instead of doing points where you say medians plus and minus, you can actually just use lines because if you use the lines function, look at that. Yeah, so we have the X and Y coordinates and the things that we want to join. So let's go to Notepad, right? So we could use, so we can say 4X in 1 to the number of columns of only weights, right? And then we would say, do a lines and then I'm going to say at X, X, right? Because I want to start at X and then go all the way, but I'm still at the same X because I want to draw a vertical line, right? This is horizontal, this is vertical. And then the Y positions, right? So these are the X positions of my line and then the Y positions of my line are going to be the medians at position X plus the standard deviations at position X and then the second Y position is going to be the medians at X minus the SDS at position X, right? And this should do it better because now you should not have the little white line on top and I have to say Y equals type cannot be H then because I'm using the lines function. I still want to make them red, right? So in this case, if we go to R then now it should look exactly the same but now you don't see the little white on the bottom, right? So it's just using the same function but now I'm really specifying that draw this line from here to here, draw this line from here to here, right? And of course from where to where is because I'm going through each of the columns, I'm at position X and it starts at the mean, so I'm drawing the line from top to bottom, right? So if I would do it on paper, it draws it from the beginning from the plus to the minus. So I hope that that's clear that you can kind of fix it as well. All right, multiple plot windows. We're now going to split up the different worthgrocer groups, right? So now we're going to see if animals which at birth are from a large litter. So if you are born and you have a lot of brothers and sisters compared to if you are born and you have only a few brothers and sisters there's a difference in your growth curve, right? So to kind of get an idea of what's going into the data. So divide the animals into two groups, worthgrocer lower than 10 and worthgrocer smaller or larger or equal than 10. Create a multi-plot that shows the growth curves per group in either style of 2B or in the style of 2C. So I'm going to be lazy and I'm going to take the style of 2B because the style of 2B is easier to plot if you look at the code for 2B, right? Then it doesn't have four loops or anything. All right, so let's answer question 3A. So first I'm going to split the individuals, right? So I'm just going to take my Fino, right? And I'm going to take the column WD2 because that's the amount of litter mates that you have and I'm going to say larger than 10. No, larger or equal than 10. Big L, right? So big litters and I'm just going to say which animals are these and then I'm just going to select them by name. So I'm going to say row names of Fino, which, like this, right? So I'm going to take the row names and I'm going to use the row names because I think that it's very important to use row names instead of using indexes because people generally refer to things by name. So I'm just going to do the same thing and I'm just going to say not this, right? So all of the other individuals, but you could also write it down like smaller than 10. That's perfectly fine as well. And these are, of course, the small Ls, right? So small litters. All right, so we define two groups. Let me go to R, let me show you guys how that looks. So when we go to R, we paste it in, then we have the big litters, and then we have the small L, and those are these animals, right? So there's much more animals from small litters than there are actually animals from large litters, but that's just the way that it is. All right, so now we want to make two plots, right? So we are going to say op par mf row is c1,2. So one row, two columns. And then I'm going to calculate the medians, right? So I'm going to say med.l, right? So the mediums for the large group. So I'm going to say apply to only weights. Take the individuals which are in the, I'm going to say dot big, right? Because I'm calling it big L, so let's call medium.big, right? So I'm just going to say take the big litters and then just do it like this and then apply to this matrix, to the column, the median function, median function, and then say na.remove equals true, right? If there's no measurement, don't use them. And then I'm going to say the medians of the smalls. And then I'm just going to say small L and I'm going to duplicate this and I'm also going to do the SDS calculation because I also need the standard deviations for the big animals. I'm going to duplicate this one and I'm going to be the standard deviations of the small animals, right? So just compute what I need and now I'm just going to do my plot twice. So I'm just going to say, well, I'm just going to be lazy. I'm just going to plot the first one. So I'm going to say plot one to the length of the means. Doesn't matter because they're always eight in this case. Normally you would say the medians of big, right? And then we would say, so we just copy the plot which we had, add the X's, do these and just do the whole thing. Why not? And in this case, we want to not plot the means but we want to do the medians big. So we're just going to change this and then we're going to do the standard deviations of the big ones as well. And then I'm just going to duplicate the whole thing and at this point I should realize, hmm, I'm duplicating a lot of code now. I probably want to make a function, right? Because look at the amount of code duplication that I have, right? Like you can see that this is exactly the same as what I already did here. This code and this code is exactly good. So I'm not going to do that. So at this point I'm going to say, no, I'm duplicating way, way, way too much code. So what am I going to do? I'm going to write a little function. So I'm going to say plot as to be, which is a function, right? It takes only weights. So it takes the weights. So this is the weight matrix, weight matrix, weight M. And then I want to select on smaller or larger than 10, right? So that's going to be one of the parameters that I need to kind of define, right? So I'm going to just say fun is, well, not fun is. I'm just going to say, how did I do that in the original assignments? No, I'm going to just instead of doing the whole weight matrix as an input, I'm just going to do the sub matrix, right? So I'm just going to input these things. That might be smarter. So I'm just going to say weight sub matrix, right? And then this is going to be my function. And then I'm going to take all of the code that I had for to be, put it in, going to have to do a little bit of tab work to align everything. And now I'm just going to say, well, instead of using this one, I'm now going to do calculate the two things that I need, right? So I need the medians and I need the standard deviations. So I'm just going to calculate medians, standard deviations. And instead of doing it on this one, I'm going to just say, do this now on the weight sub matrix, right? And now I'm going to have here medians, because for some reason we switched from using means to medians, but the rest should all still be fine, right? So now I have my little function. So now I want to say plot S2B. So what do I want to plot S2B? Well, I want to plot the only weights for the big letters. And then I want to do the second plot. So the second plot is going to be the only weights for the small letters, right? And I can delete all of this code, which I don't need anymore. And I can just use the up bar MF row. And then this is going to be my answer, right? So I write a small function. The small function just contains the code from 2B. So that's the way. And instead of using the whole only weights, I'm just going to use a weight sub matrix. All right. So let's see if this works. And if I did not make any mistakes, because this is a lot of code to write in one go. Okay, so now it says SDS of mode function was not found. So let's me see why we say that, because we apply to this SDSNA.remove is true. So that should, in theory, work. Let me see if this is really what I think it is. Yes, that is really what I think it is. And why cannot I apply? So let me call this one just the weight sub matrix, right? So I'm just going to say weight sub matrix is this. And now I can go to notepad++ and I can just do the code as if it were the function. All right. So now why does that not work? Ah, SDS. That's what I did wrong, right? So because I have SDS and SDS is, of course, my standard deviations. What I actually meant, do the standard deviation, the function. So I'm going to go back to notepad++ and fix this little error. So I'm just going to call the SD function here. All right. So let's try this again. Go to R and now we see that there are two plots, which is kind of what we wanted. I'm going to plot it a little bit and plot it again. So now we can see here that we have the, and now I don't know which one is which. So I'm going to go back to notepad and I'm going to say comma main is main, right? Because I want to change the name based on what I'm looking at. So here I'm going to say the first one is big letters and the second one is going to be small letters. All right, so copy it, go to R, paste it into R, look at the plot that we saw. So we see that the big letters and the small letters, they look relatively similar, right? But what we can kind of see is that the big letters actually tend to end up being a little bit smaller compared to the smaller letters, right? So you do see that at the end, there does seem to be a little bit of a difference in the body weight, right? So it does seem to matter how many brothers and sisters you had. Of course, this might be the case also at the beginning, but here it's not as clear, right? So what we could do to make sure that we really figure this out is say that this function is going to return my medians, right? So that I know exactly what the median is at each of the days. So I'm just going to return the medians so that when I do the plot, it poops out onto the screen the values that it plotted. So now you can see indeed that when I plot the big letters, so the big letters are having an average weight of 9.6 grams, but the small letters on average are almost a gram heavier. And at the end of the curve, it's very clear. Individuals who have a lot of brothers and sisters on average end up being 36 grams. Individuals coming from small letters end up being 40 grams. So 4 grams of difference. And that is a big difference, right? That's like 10% of the body weight at that moment. So you see that the amount of brothers and sisters that you have influences your body weight even 70 days after you were born. So there is a big influence on if you have, when you are a mouse and when you're born, if you have to fight with your brothers and sisters to get milk, then you see that you tend to end up being smaller than when you only have very few siblings to fight with. So there's a big, big effect. And this is the reason why the mother eats a couple of the young or tends to eat a couple of the young sometimes. To give the other ones a better chance of survival. If she had to raise 12 little mice, then every little mice would have to fight all of the 11 other siblings to get milk from the mother. But if there's only 8, then there's less fighting. So they have easier access to the milk, which in the end makes them stronger and heavier and gives them a better, better start. So there's a real biological reason for this happening and we can see that in the data. Happened almost every time. Yeah, if you ever bred mice or many other rodents, then there is this kind of selection by the mother saying that, well, you are kind of the runt of the litter and you better not be here. And you can clearly see the difference also here in the plots, although in the plots it's not as clear as when you look at the values. And that's not to say that the mice from the big litter cannot be 40 grams, right? It's just to say that on average, they're like 10% smaller. All right, so that's 3A, right? So you see that at a certain point when I'm copying and pasting too much code, I decide it's probably worth it writing a function, right? And I call this function twice, but it does save me around like 14 lines of code. So I'm not going to duplicate 14 lines of code, then it's just going to go into a function and I'm just going to call the function twice. So instead of having 28 lines of code, we end up with having 14 lines of code plus two times calling the function. So it's a big saving of space in the script. Good. 3B, create a four times two notched box plot, one for each time point, showing the combined data of individuals in the two groups. Label the groups correctly in the plot. What do we learn about the effect of litter size on body weight? Make sure the access are readable and the individual box plots have a main title which mentioned today. All right. So we have eight different time points, right? So what I want and I reading it back, I understand that it's a little bit complex and the way that I formulated it is not the best way of formulating it properly, right? Because this question could have been easier formulated saying for each measurement point make a box plot showing the small letters versus the big letters. But create a four times two notched box plot, one for each time. So it's a very roundabout way of formulating it. But the thing is, is that when you are working as a programmer, you have people coming to you that are non-programmers, right? So they will describe their requirements in these roundabout ways as well, right? It's like, the question is formulated in a way that you have to read it like twice to kind of figure out what the guy really wants. But that is, of course, what a programmer is forced to do as well. Because as a programmer or as a bioinformatician, you have to be interpreting the needs of someone who does not know programming, right? So they won't tell you, I want to have a histogram with blah, blah, blah. No, they will say, ah, could you make a plot that's kind of similar like this? And they sometimes will draw a little graph for you. And then you have to kind of interpret what they want. But yeah, in this case, we want to do more or less box plot comparison, right? So we want to show eight box plots, four times two. So four rows, two columns, or four columns, two rows. And then it should be clear. So let's just do that. So we are going to say, oh, yeah, we are a notepad. Very good. So I'm just going to say, instead of one by two, I now want to have two rows, four columns. And then I'm just going to say four x in one, two. The number of columns of only weights. What do I want? Well, I want to create two box plots. So I want to create... So there's two ways of doing this. So we can do this using a function, or we can do this using a subset. Because we are a formula, or we can do a subset. So let me first show you guys the formula. Because we had these big L, small L here. But I'm just going to say, I'm just going to take not out the which, but I'm going to just take out this thing. So if we look, let's go to R, because I want to kind of define a formula. So I have the worth closer to two. And then I want to know larger than or equal to ten. Larger than or equal to ten. And now if I say as numeric, I get zeros and ones. So ones means that the litter was greater than ten. And zero means that the litter was smaller than ten. So I can now use this to do a model. So I could say, from the phenotypes, take the day 21 column and then model it by this vector of zeros and ones. Because we now have groups. And then I could say, well, do a boxplot. So when I do it like this, then now it makes a boxplot for me saying zero and one. So this is the zero group. So the zero group is the one for which this was false. And the one is the group for which it was true. So this is the small letters, and these are the big letters. Of course, this plot looks a little bit annoying because the axes are not okay. It needs to be flipped and all of these things. But I can do that. So I can just say main is day 21. So then it already... Oh, let me close the window. So this is a little bit better. So now I know which day I'm looking at. I want to have it notched. So notch is true. Then it starts looking like this. Then I want to say x-oct is null, right? Because I don't want to have an x-axis. I want to have the x-label being litter size. Oh, size, right? So now the litter size. I want to flip the numbers here to flip them around, right? So that they are nicely readable, which is comma LAS is two, right? So now the numbers are the correct way around. And then I want to say the y-label is body weight and that is measured in grams, right? So this is more or less a plot that I want. I still need to add the axis. So I'm going to say axis one at is zero comma... at is c, zero comma one, right? Because this was group zero and this is group one. Well, group zero is actually called small and the other one is called big. And that's wrong because it's actually at one and two. All right, so let's do the box plot. Do this. So this is how I want my box plots to look like. So I'm just going to make a plot and then continue and continue. All right, so now I can go and I can take this code, right? I can copy this. I can go back to Noteblad and I can now put this into the for loop. So I'm going to say, well, for each of the columns of only weights, I'm now going to, instead of say the phenome, I'm going to say only weights at column X is determined by the as numeric of this. And that's perfectly fine because this doesn't change, right? It's not different from and I can actually call this slightly different because I'm going to do this control X and I'm going to call this group groups, right? So groups is this and this is answer 3B, right? So I'm going to define my grouping variable. Then I'm going to take the column X out of only weights and then I'm going to split that by group and then this is the main is not going to be day 21 but the main is going to be the call names of only weights and then I'm currently looking at only weights column X, right? And then this can go to the next line because I want to notch them. Don't do the labels, don't do the axis and now it should go through all of the different days and at each of these days it should make a boxplot for me showing the small letters versus the big letters. All right, let's see if this now works. So go to R and object group not found. Yes, because it's groups. So let me fix that quickly in Notepad. I'm going to copy and paste this back in. And now here we see that now we have boxplots for each of the different letters and we can see that at the beginning there is a significant difference in body weight between the two groups. At day 28 there's still a significant difference and this difference stays significant more or less until the end of the curve. So we know that at each time point that we look at there is going to be a significant difference between animals born from large letters compared to individuals born from small letters with the individuals from small letters being heavier than the animals being born from big letters. Good, so that was question 3B. And of course I could have done the subset, right? I don't have to do it using this function. I could just make two boxplots and then use add equals true, right? I could set up the plot window myself, add the first boxplot, add the second boxplot. But this is just a little bit quicker and this kind of way of noting down functions or formulas, right? This is a formula in R. This is something that we will start using more and more, especially when we do statistical testing because that is where we want to do group comparison. So if I do a t-test, I can specify a t-test the same way as that I'm specifying my boxplot here. So this tilde here means that this is a y equals x kind of structure. Good, for the subsets, let me see. Yeah, subsets work more or less the same way. This was the code that I had for the original subsets. So let me just copy paste it in for you guys. So this is 3B from the answer sheet, right? So I did this before. So what I'm doing here is I'm splitting it only weights into worth gross of small, worth gross of large, and then I do the plot as 2C or plot as 2B. But then here for this 3B, what I do is just say make a boxplot of the worth gross for the day, for the worth gross of small of this day, worth gross of large of this day in the main substitute the day by day, right? So instead of having a D and then a number, it says day, space number, not just true in the same way as that we did before, and then here it has the axis. But here the trick is to make, to subset the only weights matrix into two separate matrices, and then just make two boxplots in one. So that's using the subsets. Anyway, the answer sheet will be online as well. But this is when you use the function operator or the formula operator. Good. I don't think we actually discussed the formula operator yet, but it's something that we are going to do today. But this is an alternative way to the way of doing the subsets and then plotting the subsets. Good. So 3b. Then we go to image. Okay. So using the data from the genotypes.txt file, first use the data from map.txt to subset the genotypes for a single chromosome. After create a subset, after you create a subset for eg chromosome one, create an image for the chromosome showing the different snips in the order in which they occur in the map, right? And then there's all of these steps that you need to do. And because of the fact that when we looked into R and we looked at to the map function, right? So if we look at the map, then we see that the map is not sorted. And also this happens a lot, that you get data which is not going to be in the exact order that you wanted. So there is a sort function. So we are just going to use the sort function to sort the map, right? So what we are going to do is we can say, well, we have this map, so hashtag four. So we have this map, right? So, and we want to sort this based on the position. So when we look into R, we see that the base pair position is called MB NCBI 38. So you don't have to worry about what it exactly means, but this is the position in base pairs and this is the position in centimorgans. So genetically speaking, and like genes are located on the genome at a certain position, a certain number of base pairs from the start of the chromosome. But centimorgans are a kind of recombination thing, right? So, but that's all genetics and we don't really want to deal with it. We just want to first sort this column, right? Because we want to sort all of the markers in ascending order. And then of course we are going to subset and take out chromosome one or chromosome two or chromosome three. But as soon as we sort by this one, everything will be sorted from smallest to largest. And then we can just take out chromosome one. And of course, if this stuff is sorted based on this column, taking out chromosome one will give us also chromosome one sorted. So let's go to Notepad. So I'm just going to say map. Take this MB underscore NCBI 38 column, sort it and do index.return is true, right? So return me the indexes because I want to use the indexes. Just say index is true, right? Because I want to have the ordering, I want to have the number sorted, but I also want to know for each number where in the sorted data it belongs, right? Because it could be that the first element, so the first number that we see here in R, so like 135 million, that is actually if you would sort everything, it would be this 200th number. And this number here, 16 million, that would be the 25th number, right? So that's what the index return function does. It just says give me for each element, not only don't give me just the sorted order back, but also give me the indexes at which each of the numbers are there, right? So let me just show you guys how this sort function looks, right? So first, I get back all of the positions sorted from lowest to highest. And of course, there's a bunch of markers apparently. Oh, yeah, there were 3,000 markers, so this runs for a little bit. But once we have done all of the marker position, then you see here that, oh, interestingly enough, like this first number that we saw is actually in the sorted data, it's actually number 414. And the second one is 1280. That doesn't seem right in a way, because the first number seemed to be much bigger than the second number, right? You would expect this one to be much later than the first one, which it is not. So I'm just going to make sure that they are numeric values. So I'm going to say s.numeric, right? Just in case that r actually starts fucking up, so I'm just going to call this order, or ordered, right? So now if I look at ordered at $ix and I look at the first 10, yeah, now it still doesn't make sense. So let me see if I take the map and it'll be fine. It'll be fine. R probably knows better how to sort stuff. Oh, you're here, very good. You're just in time, just in time. You almost made it, or you almost missed the custom-made GIFs of Oscar. But let's finish ordering the map just so that we have that done. So I'm going to sort these things. I'm going to return the index, and I'm going to call this ordered, and then I'm just going to say map, use ordered $ix, so the indexes, and then I'm just going to store this back into the map, and then I'm going to see if the map is really sorted in order. And just so that for my own peace of mind, I know indeed that it is true. So now if I look at the first 10 elements of the map, then I do see indeed that this column seems to be sorted in order. And now I understand why actually this thing is also not... So here it says ix, so it says that the smallest number of your data set is at position 414. The second smallest number of your data set is at 1,208. But you can see that it just sorts all of the numbers, and now if I would take out a single chromosome, because I stored this back in map, so now if I would have said map, chromosome, map, chromosome, izz, izz, 1, and I would take those out of the map, then these would be all the markers on chromosome 1, and they would now be sorted in the correct ordering, so from smallest to largest. So that's kind of what we are looking for. Good, so since Humph is here, we will do a little break. So we're going to take a break of around 10 minutes. I will run outside, do a quick cigarette, drink a little bit of coffee for you guys, or not for you guys, but I will drink coffee for myself. Please take a little break. There will be music again, and like I said, we will have animated gifs of my own cat. So unfortunately, there is a very limited number of gifs, so they will loop and they will loop and they will loop, but I don't have like 20 animated gifs of my own cat. I had a couple, but some of them were not fitting the format, so I had to switch them around a little bit. But with that out of the way, let me start some music for you guys, and then we will start the first break, and I will be back in 10 minutes. A very unexpected end of the song. All right, I hope you like the custom gifs. It's unique, right? I've never done custom gifs of my own cat before, so I would like to thank my moderator for making them. So perfect. Everyone loves cats, and Oscars the best. All right, so let's continue with the assignment, right? So we now ordered our genetic map, and of course, after we've ordered our genetic map, also in the assignment it says, iterate through the chromosome and select the SNPs, e.g. using row names from the genotype matrix that are on the chromosomes, convert the genotypes to numeric values, and then open the device for plotting, use the chromosome name to create unique file names, create the plot, and so we're just going to write out the images, right? So in the end, what we kind of want to make is, let me self, so have what we want, is we want to say map, and then the map we want to take the chromosome column, for example, this is one, this is one, which ones are those, and then we want to take the row names of the map, which are on chromosome one, going to say on chromosome, and then we can use this to then select from the genotypes the markers which are located on chromosome one in the order at which they were sorted, right? So we already did this, so the first marker on chromosome one is called UNC01, blah, blah, blah, and then the second marker is called like this. So let me de-ordering, and now chromosome is written with a capital probably, right? That's probably the error, yes, chromosome is written with a capital. So chromosome, capital, and we go back to R. Right, so on CHR, so these are the markers which are on chromosome one, there's 193 markers, and indeed it starts with this one, and then the next one, so it seems to be okay. Right, so we want to then take from the genotypes, right? So let's just look at the genotypes quickly, which I stored in Gino. So here we see that some individuals are A, some individuals are B, and some individuals are H, right? So A means that you're from parent A, H means that you have one chromosome from parent A, one chromosome from parent B, and B means that you have two chromosomes from parent B. So that's the way that the data is structured. And now we want to, of course, make an image of this, but to make an image, we need to have numeric values. So to make numeric values, I am going to have to do a little trick because I'm just going to have to go through each of the rows of this small matrix, right? And then I have to transform the As. Why do you select in the genotypes only 1 to 10? Because there's 3,000 markers and 100 plus individuals. So then it would kind of blow up the whole screen, right? If I would just say, give me the first 10 rows, then it would already start scrolling and scrolling and scrolling and scrolling, right? There's many, many individuals. There's many, many markers. So it's not easy to, so I'm only showing you a very, very tiny part of the data because the whole data is just too big to show. And this is only the first 10 rows. In total, there's like 3,000 rows. So then it would even be like 30 times longer waiting until you see everything. All right, so the markers on the chromosome. So from the genome, right? Take the markers which are on the chromosome, right? And then this is my genome CHR, right? So my genotypes on this chromosome. And then I'm going to apply to the genome on chromosome to the rows a function of X, right? So X is going to be the current row that I'm looking at. So I'm going to say as factor X, levels is C because I want to have the A's being parent 1. I want to have the H's being the middle group and I want to have the B's being the last group. And then after I convert it to factor, I'm now going to say as.numeric and now it will convert the A's which are the first to number 1. The H's will be converted to 2's and the B's will be converted to 3's. So let me see if this works and I'm only going to do this for, again, a little piece. So 1 to 10, 1 to 10, right? So let me show you guys how that looks. So if we do the subset, right? Then let me first show you the first 10 markers, the first 10 individuals, then it looks like this. And then I'm going to use my little apply function and this apply function is levels. Why is that wrong? Well, I have to do factor instead of as factor. As factor this converts and factor. So I've just removed the as factor to factor. So factor is a function which takes values and converts them to factors. As factor is just converting them as is. So it will take the first thing that it sees and make it a 1, the second thing that it sees and it will make it a 2. So this should then work. Right, so we have to check, of course, if it worked. So let me also output the first 10 values for this. So in my mind, I wanted to convert the h's into 2's. That's okay. In this case, the b got converted into a 1, which is wrong, and the a... Oh, hey, that's interesting. So you see that it actually flipped it around. Yeah, that's okay. So now by going through the rows, the markers which used to be in the rows are now all of a sudden in the columns. Well, that's an easy fix. So we can easily fix that by just saying transpose the whole thing after we do this. So turn the rows and the columns around. All right, so let's make sure that it's okay. So let's print the data as it is and then do the transpose. So let's see if it works. So here we have a, h, a, h. So this would be 2, 1, 2. This is 2, 1, 2. And here we see b, b, b, b, b, b. So that should all be 3, 3, 3, 3, 3. That's also seem to be okay. h, h, 8. So 1, 2, 2, 2. And that is 1, 2, 2, 2. So it does seem crap. You guys are not looking at the R window. So it's just checking if the conversion went okay. So here we see h, a, h. And that got converted into 2, 1, 2, which is correct. And here we see b, b, b, b. So here we see b, b, b, b being converted to 3, 3, 3, 3, 3. So just like I wanted to. All right, so now we know that it works for this small subset. So now we're just going to do it for the whole thing. So let's go back to Notepad and we're just going to remove the 1, 2, 10, 1, 2, 10 here. And we're just going to convert everything. And then I'm going to say that this is genome for numeric values. All right, and now I can just do the image, right? So I can say image, genome, and then it should plot my numeric genotypes and give me a heat map of the different chromosomes. So this is then a heat map of chromosome 1. Let me actually do this full screen. So you directly see that there's something missing, right? We don't know if these are now the markers or if these are the individuals. But we can see that there is some structure in there, right? So for chromosome 1. In this case, we might want to transpose them to see if that makes a better plot, right? Because we know that we have an X number of individuals and we know that we have an X number of markers. But let me actually figure out which one is which. So I'm just going to say I have an image. 1, 2, the number of rows of genome. And then I have 1, 2, the number of columns of genome. And then I want to have genome as the values, right? So now we can see that we have 100, like 200-ish. So I now know that these are the markers because in total we have... The legend is also missing. Yes, there's no space for the legend in this plot either, right? But we can now see that these are the markers and these are the individuals. Because if I ask for the dimensions of, for example, the phenotypes, then we see that we have 350 phenotypes measured on 350 individuals. So I know that these are the individuals and these are the markers, right? So if we look a little bit, then we can see that this individual inherited a part of a parent. Then there is kind of a recombination. And then we continue on being from that parent. So just a way of visualizing like large genomic data, right? So in this case we can look at it and it looks really good. But what we actually want to know is if these markers are linked to each other, right? Because in genetics there's something called linkage. And of course we can take genonome and we can calculate, for example, the correlation of genonome, right? And then I'm going to say use is pair, because if there's an NA I want to not be bothered by it. And then I'm going to say a genocore. So just to check if marker one is correlated to marker two stronger than it is to marker 20, right? Because every marker in the genome, if markers are close together they should be correlated and if they're far apart they should not be correlated. So now I'm just going to make the same image but now I'm not going to use genonome but I'm going to look at the correlation of the markers to each other and let's see if this works. And this is not correct because these are the individuals against the individuals, right? Because I have 300 something individuals. I only have 190 markers. So I'm going to say calculate the correlation on the transpose of genonome because I want to do the correlation not on the individuals. I want to do the correlation on markers. And then it gives me a warning message which is perfectly fine. And now we start seeing a plot like this, right? So we see that for some markers it cannot calculate the correlation because of missing values probably but we do start seeing that there is a pattern, right? So there is a high correlation of markers which are close together and the correlation is low when markers are far apart. Actually we could do this a little bit better because I actually generally want to look at the absolute correlation because it could be that there's something which is inverted and when we look at the absolute correlation it starts becoming really clear that this is indeed a genetic map, right? So we can see these blocks where markers are relatively highly correlated to each other and we see that there are sometimes a kind of bigger structures but this is kind of the genetic inheritance across chromosome 1. So that's nice that we can do something like that. But this is way beyond the question, right? Because the question was just to show the different genotypes which is plotting not the genocore but the numeric genotypes that we're looking at here, right? So we're just going to do that. So we created the code to do it and I'm just going to paste the code from R back into Notepad and now I'm just going to say 4X or 4CHR in C1 to 19 because a mouse has 19 chromosomes, then it has an X chromosome and it has a Y chromosome and then I'm just going to do the whole thing and instead of showing the image here I'm just going to write out the image to disk so I'm just going to say PNG the filename is going to be Geno underscore chromosome and then I'm going to add CHR, comma and then I'm going to do dot PNG so I'm just going to make a PNG image and then I'm going to do the plot and then I'm going to do def off for saving the plot to disk and that should look pretty okay-ish and here chromosome needs to be not one because that's just the one that they use for testing I'm just going to say do it for the current chromosome that I'm looking at. Alright, so now I have to show you guys a window which I have not prepared so let me actually add a new window capture and that's fine and I want to capture this window so this is my hard drive so where I'm going to save out or save all of the PNGs so I'm going to run the code in R and then if everything goes well we should see PNG files being created here so I'm just going to go to R so the window capture should be under the R window I'm going to copy in the code and now it will start creating my PNG files and it's putting it directly on one drive so it's also uploading them to the cloud but it will just make all of the images one by one and it's done so now we can look at these individual images by just zooming in a little bit like this I think and here you now start seeing all of the different chromosome images so that was the assignment just show the genotypes for all of these and of course we could use our colors ourselves because now we had like Peter Petter said we have the different colors and we don't know what the colors are so let's just define our own colors just to make sure so I'm going to close my window capture here and then I'm going to close the R window so I want to do the image and I want to say in this case that I want to get my own colors so I'm just going to say brakes is C so we have 1, 2 and 3 4 genotypes right so I'm going to say go from 0 to 1.5 then go from 1.5 to 2.5 and then from 2.5 go to 3.5 and then I'm going to say color is so everything which is in this range which is the ones I'm going to make blue I'm going to make a little bit more space so blue then the heterozygotes the individuals which are coded 2 I'm going to make them yellow no not yellow, yellow is a bad color I'm going to make them green and then the other ones are going to be orange right so now I define the colors myself so now if we do the image and I'm just going to do the image in R I'm not going to plot them all out to disk again and I need to close it seems a little bit bad that doesn't seem like what I want to genome oh right because we're looking at the last chromosome right the last chromosome is chromosome Y chromosome Y is a very very small chromosome so I'm going to say chromosome is 1 again right so I'm going to plot for the first chromosome and I'm just going to run the code here and instead of pooping out the P and G I'm going to paste this into R like this and then I'm going to make the image like this and now it should use my own colors and I have to close this window as well right so now we know exactly what is what right so the green colors are where individuals are heterozygotes the blue colors are where the inheritance comes from parent one and the orange colors is where the inheritance comes from parent two just for you guys to play a little bit with the image function so and yeah we could put in a legend which is perfectly fine so let me actually do a legend just so that we have one the problem is of course is that the legend will always be on top of some of the data but we could just say add a legend to the top right I want to have in the legend a b and a h and b and then I'm just going to say fill is and I'm going to just use the color vector that I had defined here and now you see that it does a little legend here right it's very poor because it doesn't have a background so I'm just going to say bg is white right so that it has a non-transparent background right so now we have a little bit of an image we have a little bit of a description of the image so there is a little legend in there as well of course it hides some of the data but that's fine in this case good so question number five chromosome plot that was a big one right so from the data create a chromosome plot showing the locations of the markers on the chromosome use the code that was given as an example during the lecture because on the lecture I showed you guys how to make a chromosome plot but I'm going to code one live for you guys now just because we can right because we have different data on different chromosomes we have already ordered our map so that's good so now we can just say well the first things that I need to know is what is the longest chromosome right so the longest chromosome is going to be map from the map I'm going to take the base pair positions which are stored in mb NCBI 30 and I'm just going to ask for the maximum value because this is the value of the last marker on the chromosome so this is my max i right so my maximum y value so if I go to R how does my maximum y value look like so my maximum y value is 194 million base pairs so I know now that I need to have a plot on the x-axis I need to have 19 chromosomes plus x and y and then on the y-axis I need to go from zero or one actually because the first position in the genome is called one all the way to 194 million so let's set up a plot like that so now we can say I want to have a plot on the x I want to have go from zero to 19 plus 2 is 21 so I'm going to go from zero to 22 just to have a little bit more space on the x-axis and on the y-axis I'm going to say go from zero to 200 million instead of why not just go to max y I'm just going to write max y and I'm going to multiply that and give it a little bit more space on top so I'm going to multiply that by 1.1 just to give 10% extra space on top and then I'm going to say type equals none the x label is going to be a chromosome the y label is going to be a position and I want to have the x-axis not plot it because I have to put my chromosome names myself because of course the 1, 2, 19 fits but then there's still x and y and I'm going to say LAS is 2 because I want to flip around the labels on the y-axis to start off with so let's make this plot so this is hashtag number 5 so let's go to R and make an empty plot so now here we see our positions which are coded in a weird scientific way which is fine and here we see our chromosomes I don't like it being a weird scientific way so I'm going to disable the y-axis as well because I don't want I want to have normal numbers on the y-axis so I'm going to do my own axis the x-axis 1 at is 1, 2, 21 what do I want to plot there? well I want to plot 1, 2, 19 then I have chromosome x and then I will do chromosome y and just to make my life a little bit easier I'm going to store this in a variable called CHRS for chromosomes so I'm just going to use this instead of having to type out the vector every time and then I'm going to say axis so the second axis I want to plot so we have 200 million spots there so I want to have a bar every 25 million base pairs let's do every 25 million base pairs so I'm going to say at is a sequence from 1, 2, 20 to max y times 1.1 and then go by 25 million 25 million and that needs to be in the sequence function like this so I'm just going to define where I want to have my axis ticks and now I'm going to say okay so what do I want to show so I want to show 25, 50, 100 and these kind because I don't want to show 25 million it doesn't make sense why not just use 25 so I'm just going to say well do the same sequence that we had before but now divide it by a million that is a million alright so now my plot will start looking like this when I throw it into R and it's still an empty plot so here we see chromosome 1, 2, 3 alright because I started 1 it always has this 1 on the end so it's only one base pair off so I'm just going to ignore that and I'm just going to say around is down 0 digits behind the comma so sequence around is down 0 digits behind the comma and then that is going to be my second axis and I have going to say las equals 2 because I want to have it the other way around just the way that I want to have it is slightly different than what R does it but I want to get rid of this like 0.0001 alright so we're cheating a little bit so these numbers are like one base pair off but that doesn't really matter on 25 million base pairs so here we see our position I have to now update of course my axis because my position is in mega base pairs and that's going to be like this good so now I want to plot the chromosomes and I want to plot the markers so where the markers are so I'm just going to say 4 CHR in CHRS right so 4 each chromosome in the chromosomes that I defined I want to do something well I want to just say I want to figure out first the x-axis that I need to plot right so I want to so I want to no I'm going to do this differently so I'm going to say 4 x in 1, 2 the number of rows of the map right because the map contains the positions so I'm going to take the map at position x the chromosome so this is the x position where I need to plot it but of course the chromosome could be x so if I would just go to R right and I would say this right then this marker that we're currently looking at is located on chromosome 3 so that would mean I want to plot it at 3 so how do I figure out that in such a way that it also works for x I also need to figure out in which number this is in the chromosome factor right so so which of the chromosomes this is the current chromosome that I'm looking at and then it will tell me this is number 3 right and now if this thing here if this thing would be x then it would tell me that x is at position 20 and if I would look at a marker on the y chromosome then it would tell me that this is at position 21 in my chromosome factor so now I know for sure that it will plot it at the correct position and I can do this using this command saying which of the chromosomes is the current chromosome that I'm looking at and that will give me one number and that is my x position so I'm going to change this and I'm going to say that this is x pos is this and then my y position is going to be the position of the marker so I can just say my y pos is s.numeric and that is going to be the map at position x at the m bay and cbi 38 and now I can just say points x pos, y pos comma pch is an x comma color is red or something like that right good so I make my plot I put the x's and then I go through each of the rows of the marker map and then I'm going to plot a little red x at the position where the marker is located so let's see if this works x and y the links differ okay so what is x pos alright that's an interesting one because the current position that I'm looking at and I'm thinking mitochondrial genome totally forgot about the mitochondrial genome there is also a mitochondrial genome so let's just add the mitochondrial genome so m then this needs to go to 22 and I'm going to make my plot till 23 then so I totally forgot that there's also mitochondrial markers and of course that won't work so I have to add mitochondria to my chromosomes as well it's interesting that no here I'm just ignoring the mitochondria so I didn't do a plot for the mitochondria but that's fine alright so let's see if this works then so now it will start plotting at each of the location of the genome a little red x and I could actually make this a little bit nicer by saying that cx is 0.2 so make not these massively big x's a little bit smaller and then if I go to r then now the x's should be a little bit smaller trust me they're still x's but they're really small but now we can start seeing that oh that is interesting right on our genetic map we have a little gap here at like 25 to 40 megabases and here on chromosome 10 we actually have a very big gap there are no markers from like around 30 megabases all the way to like 60 megabases there's no resolution here because we don't have a marker there but this is how you do it and this is kind of similar to what was in the in the example but it's again a little bit different but that's because plots are very creative you can do anything that you want we can also flop colors around because we are just saying x is the n-row of the map but we could say the x-boss so we could actually just like do some magic and make this make different colors and make it look a little bit more interesting right so we could just say 0.4 just to make it a little bit nicer we could say we have two colors right so if our chromosome number is even I want to have it red and otherwise I want to have it black for example right and then we can say x-boss right and then we can use our trick again x-boss %2 is is 0 so the remainder after Euclidean division I think that that's this let me see x-boss 1, x-boss let's make it 2 and then false yeah so this hey Euclidean division just checking a little bit if this works and then I can say as.numeric right so I'm just saying is it even or if it's odd and then this is going to be a little tricksy because this is then even or odd and then I say even or odd plus 1 because of course I'm having to select and this is number 1 this is number 2 but this gives me back 0s and 1s right because something is either even or it's odd so it's true or it's false so it's 1 or 0 but selecting from a color range like this because this is number 1 and this is number 2 so I just have to add 1 number to it so I'm just going to use this even or odd vector or if it's even or odd and now if we would plot it then every chromosome would have some of them would be red and some of them should be black that's wrong I take the Euclidean division but not the Euclidean division remainder so let me look at it like this and now this is how it should be right so now chromosome 1 is red chromosome 2 is black 3 is red and the other way around so we can add all kinds of additional information to it alright perfect then there was still one additional question which I will upload the answer to but I won't go through it with you guys so that's left for you guys as an exercise so again like they were heavy the exercises this time right I used a 123 lines of code 123 lines of code is a beefy exercise and I know that these exercises are beefy and you have to spend some time on them right I already spent like one hour and 45 minutes on the assignments but these are there for you guys to practice because if you don't practice you will never get good at programming so that's why the exercises are long I'm trying to take you by the hand and by having like relatively big descriptions but I hope every everyone was able to do it and at least tried right it's not about finding the correct answer the assignments are there for you to practice get stuck send me an email or google around but they are challenging for a reason they're not challenging because I like you guys to fail they're challenging because I want you guys to learn great oh welcome good alright then we took a long time we almost to one hour 45 minutes like I said for the assignments but I think that this was useful right you guys can see how I think about these things how I build up my own code and how I try to reason my way through the exercises so I hope that that it was interesting for you guys as well and for today we have we can take a break and then yeah that should still be possible we will take a like seven eight break and then we will start with the lecture for today so the lecture for today is statistical testing I will quickly show you guys what we are going to do so we are going to talk about microarrays not so much because I want you guys to know stuff about microarrays but I think microarrays are one of these techniques or these well they're not really novel anymore they were a novel when I started to do bioinformatics but nowadays they have been in the field for like 10 years 14 years so they're not that novel anymore but microarrays are good because they hit on all of these little questions which are in statistical testing right so we have when you do microarray analysis you have to talk a little bit about experimental design I'm going to explain to you guys what microarrays are of course I'm not going to ask you questions about microarrays during the exam that makes no sense because it's a programming course so you should be able to pass a programming course without knowing anything about microarrays but I'm going to explain you guys how do we work and then we're going to talk about things like normalization which is a programming thing and then we will talk about probabilities about student t-test about things like normality assumptions that come with student t-test we will talk a little bit about correlation we will talk a little bit about multiple testing issues and then I'm also going to show you guys where you can get free microarray data in case you want to do more analysis or more practicing on how to do statistical testing to prove if genes are differentially expressed or if they're not so with that being said this is going to be the overview of today I'm going to go back to the intro slide and we will do the second break so the second break is going to be tigers and again we will have a little bit of music I will do a quick cigarette probably and I'm going to get myself a new cup of coffee so take a little break relax your mind a little bit get a cup of coffee yourself or a cup of tea I don't mind and I will see you guys in around 9 to 10 minutes I hope I'm back before the song ends I think that the song is very short so I hope that I make it back in time we'll see anyway let's continue and start the music made it back before the music stops so that was someone put the elevator on the wrong floor so I had to really run against it hi so we're back so let's do the lecture like I said we are going to talk about experimental design so the questions that you can answer when you use microarrays we will talk about things like normalization of microarray data then probabilities so what is a t-test normality assumptions the correlations and stuff a little bit about multiple testing to make sure that you know what is significant and what's not and I also wanted to show you guys the microarray data so when I talk about experimental design and project planning I always love this quote of Ronald Fischer so the quote is like this to consult the statistician after an experiment is finished is often merely to ask him to conduct a post-mortem examination he can perhaps say what the experiment died of and that is really true if you are doing science and you're designing experiments consult a statistician at the earliest time point have someone check your numbers because you don't want to do an experiment and in the end figure out that with 10 more samples you would have been able to prove what you wanted to prove but now because you did not do the 10 samples you lack the statistical power so make sure to always ask someone more knowledgeable about you check your statistics or to to make sure that it works right like that you don't want to end up in a situation where you spend a lot of money doing an analysis or doing a like gathering data doing the whole analysis and then figuring out that your p-value ends up being 0.06 right so just borderline not significant so when we talk about microarrays so microarrays are little glass plates that are used to measure the expression of genes so when we do these kinds of experiments we generally want to know things like which genes are differentially expressed when we look at healthy tissue and disease tissue we want to take 50 cancer samples we want to take 50 healthy samples and we want to see which genes are different between the healthy tissue and the cancer tissue to perhaps find biomarkers that we can use to detect cancer or which can be targets for intervention studies we can also for example want to do tissue profiling so we can also ask the question which genes are expressed in which tissue so can we find like 10 genes and by measuring these 10 genes we can figure out if the tissue that we're looking at is liver or if it's lung or if it's brain or if it's tissue because that might be a question that is interesting and of course there is also the question does the gene expression change after we administer a drug or treatment so if we are thinking about a big drug intervention study where we give 500 people a drug and 500 people a placebo we might be interested in what does the drug actually do on a molecular level on which genes does it turn off so that is that is one of the other questions that we can ask and of course all of these questions have different design considerations but first I want to talk a little bit more about microarrays so microarrays are arrays there are two fundamentally different types of microarray there are one channel microarray which provides the intensity data for each probe or probe set indicating a relative level of hybridization with the label target that means that you just get an intensity right so you have these little glass spots like you have here right so this is a two color array but normally if we have a one color array it looks very similar but each spot just has an intensity value so it's either fully on or it's off or it's in the middle and based on that we know that a gene is highly expressed or if a gene is like medium expressed or if it's not expressed at all if we look at two channel arrays those are arrays on which you hybridize two samples together so that's what we see here on the side so we see for example we have cancer cells which we grew in in culture we have normal cells which we grew in culture we isolate the RNA then what we do is we take the RNA and make cDNA using reverse transcriptase we label the cancer samples red and we label the green samples we label the healthy normal tissue cells we label them green then we combine these two in equal amounts or equimolar amounts and then we put them on the micro array and then we can see which genes are highly expressed in the cancer cells because they will show up as red spots versus genes which are highly expressed in normal cells they will show up as green spots so this is if we are asking the question what is different between healthy tissue and disease tissue then we are generally dealing with two channel micro arrays but for example which genes are expressed in which tissue in that case we probably are going to use a one channel array because we just want to know how highly a gene is expressed in different tissues so we would do one channel micro arrays for lungs and then we would do one channel micro arrays for brain and then we would want to see which genes are indicative of lung tissue and which genes are indicative highly expressed in brain tissue for example so the question that you have determines which micro array you can or is best suited to answer your question so the simple workflow if we look in a little bit more detail we have our sample where we purify the RNA or the DNA depending on what we want to look at generally we want to look at RNA because we look at the gene expression so we purify the RNA we do a reverse transcriptase in which we take mRNA and we create cDNA and then we label our cDNA using a certain dye either green or either red and then once we have them then we hybridize the sample to the glass slide but the excess goes off we put it into one of these machines which has a laser in which scan so it will hit each dot with the laser, see the intensity value and then in the end we on the computer what we get is kind of an intensity ratio so some genes are highly expressed and other genes are lowly expressed so how does this actually work on a physical level so the physical thing which is happening this is kind of the fishing expedition that we're doing because these little glass plates that we have they have probes attached to them so these probes are in a spot and these probes are little probes of DNA do you also work in a lab I used to I don't go to the lab that often anymore but I was in the lab like this week because our minus 20 broke down so I had to move stuff from the minus 20 to the other minus 20 but I don't do a lot of pipetting anymore but I am a trained molecular biologist my master was in molecular biology so I know my way around the lab which is of course really good when you do bioinformatics because you know what can go wrong in the lab if you never were in the lab then you have no idea what people are doing there so it's very important that you have not only computational experience but also a wet lab experience to kind of know what people are doing and what goes wrong and I do help our PhD students as well with their experiments so I have a date in three weeks to help with dissection of bats so I'm not doing the dissection I just have to weigh different tissues and then write those things down but I do help out in the lab if it's needed but a microarray works because it's a fishing expedition so you have these little glass plates on these glass plates you have little spots of DNA and then hey you have single stranded pets or bats bats, fleter mouse so the flying mice we're going to we have an experiment planned for no they're not pet bats they're experimental bats they're not pets cool I don't know I like mice more they're much more cuddly than bats but that's about my future work that I'm going to do corona bats, no these are from Mexico these are these ones with these weird nose so they have the they will be if I get my hands on that I know I know, no these are the ones with the weird nose and these also do echolocation and we're really interested in them because they are very energy efficient so our lab is interested in like energy expenditure and energy in energy out where does it go and bats are a amazing model because they are they are special they are really special in the in in research because they are completely different energy wise because they're mammals but they fly and for flying you have to have a very high like energy output to weight ratio compared to a mouse which just runs around so mice can be really really fat but a bat can't be fat because then it's unable to fly and they are kind of cute they have these little snouts and stuff so anyway I guess you guys read the slide now so at least that's what I hope so a microarray is a fishing expedition I'm just going to do it again so it's a fishing expedition you have single stranded DNA probes you have the mRNA which has been transformed into single stranded cDNA which is labeled and these probes will bind based on complementary sequences we know that you have DNA and you have ACTG then the complementary sequence will bind to it so the more complementary base pairs in the sequence of the probe compared to the target the stronger the binding will be and that's why you have this washing step because you want to get rid of everything that binds not very tightly so you want to have kind of a perfect binding so that's how a microarray works so this is a picture of the data how it looks like this is a two color microarray you can see that by the color of the spots some spots are red some spots are green in this case this microarray this gene here or this little DNA sequence here was highly expressed in the sample that we colored red well this gene here was highly expressed in the healthy tissue which we labeled green so they generally come in slides like this so multiple microarrays are on a single class plate and that is because of miniaturization so here you see that there's four microarrays in the X direction and in the Y direction there's like 12 of them so you have a 48 you have space for 48 samples actually that's 96 samples because it's a two color microarray so you have to put two samples on each of the wells so very very small very very easy to work in pipiting them on so what can go wrong right because if we deal with data then there's a lot of things which can go wrong so the main well not the main thing but one thing which goes wrong often is that there's a little fiber or a scratch on the microarray because of course you're trying to work as clean as possible but there's always little things floating around in the air which hit the microarray and then you get this little scratch what you see here so there's just a little fiber here on the microarray meaning that we cannot use the probes underneath this little fiber because we can't see the intensity and even if we could see the intensity then we are not sure that the intensity is actually correct one of the other things which happens a lot especially on the sides of microarrays is that there's these little air bubbles in there so these little air bubbles create these little gaps and of course we have no information in the gaps besides that then there's the effect of microarrays because near the edge of probes which are near the edge they they tend to not function properly and you have the background haze and this is kind of if you look at the real pictures they look more like this than like this these are really like clean microarrays but normally a microarray looks more like this or like this so there are a lot of issues that happen with microarrays and you have to be aware of when you are analyzing microarray data that the probes that are located on the extreme side of the array they generally tend to have because these glass plates they're etched so there's a little bit of a ridge there and fluid will run up to the ridge but it won't run over the ridge but sometimes it does and if it runs over the ridge you get this band of intensity color so we can't use this and of course it's not correct for the background haze as well so these things just mean that you lose data because you can't fix a fiber and you can't fix an air bubble but the background haze like here and the spatial bias those can be more or less compensated for because they're very common so you can look at the average intensity surrounding one of these spots so you can do background correction or spatial normalization to get rid of these effects so if we do background correction we want to adjust for non-specific hybridization so areas of the glass slide where DNA just hybridized to the glass itself and not to the spot because hybridization of sample transcripts whose sequence do not perfectly match those of the probes of the array so in the old days so that is like in end of the 1990s beginning of the 2000s we looked at these hybridization, non-specific hybridization from looking at the fluorescent level in the immediate vicinity of the probe so you would just look at the probe and then you would look around the probe and you would see what kind of the intensity was surrounding the probe but you can't do that anymore because these spots are so tiny at the moment they are so close together that you cannot get an unestimated effect an unbiased effect of it so what we nowadays do because we now have 200,000 probes on a single one by one centimeter plate we use exogenous negative control spots so that means that when we make a microarray which is aimed at detecting human gene expression we put some plant probes on there which we know are not going to work for humans but these plant probes they will have this non-specific hybridization so the intensity of these plant probes will give us an idea of what the background effect is if you work with afymetrics arrays they actually have a very good system for also dealing with this they have mismatch probes so these are probes which target a sequence and then on purpose they made little errors in the sequence so they are having two base pairs or one base pair different from the real probe and they then put both probes on the array and then you can see this is the probe which fits the sequence exactly and this one has a little mismatch and then the difference between the two is actually the real intensity value spatial normalization can be done as well so it's applied when there is a strong dependence on the intensity level on the probe of their spatial location so you can see that more or less here so you can see that there is a bias that there is low intensities on the bottom of the array and then you can use the physical position of the probe on the microarray and the average intensity of the complete row and then in the end you kind of correct for the spatial bias and if you want to know more about this then definitely look up microarrays and how you should do it so the only thing that I want to tell you guys that many of these things are done in R so you have the bioconductor repository which is available which has many different software tools for microarray data pre-processing and they are all available at bioconductor so you can get packages for for example normalizing single channel arrays like MAS5 or GCRMA which is robust microarray average and when you have two channel arrays the standard way of dealing with all of these background effects and spatial effects is called LOS so these are different packages that you can download and just install in R and when you get your microarray data which is generally in a cell file you can just take the cell file load it into R apply the known or the best practice to it and just run the algorithms to correct for these effects so a common microarray workflow for a bioinformatician more or less looks like this so a bioinformatician is there to create these arrays during my time here as a postdoc I've created like three or four custom microarrays because sometimes you're working with a species like the bats that we are going to do in three weeks these bats don't have a genomic sequence so when we take DNA from them we sequence the DNA and then we make special microarrays because you can't just go to a company and say I want a microarray for this specific species of bats because the company will say well we don't have those for mice of course and for humans there are just standard off the shelf microarrays that you can buy but if you are working with some lizard which is living in the desert in a country where no one ever goes to it's not going to be a microarray for that one so as a bioinformatician you sometimes have to define or have to make these oligo arrays and it's a very fun job to do and this is called the file format that is used there is called the TDT format so the TDT format if you do like array express or earray then you have to select which areas of the genome you do target, you do probes and then it gives you that these probes work well together it's very similar to primer design so if you're interested in primer design there's a whole lecture about how to design primers to amplify DNA on my channel you acquire samples you extract RNA you do the reverse transcriptase you do the PCR step which is optional you do the labeling and then it starts again so this is done in a lab so you can see that there is some lab work but most of the work in micro arrays comes after doing the labeling and the hybridization so after you do the hybridization on the micro array you scan them and then you get a TIFF file format so a TIFF file is just an image file it's like a JPEG or a PNG but it is an image format which allows you to have very high resolution and see like 200,000 spots on the micro array this TIFF format is then converted into a cell format and this is a proprietary format but this format is useful because it's much smaller these TIFF images are generally in the order of like 200 megabytes to a gigabyte while this data storage in cell files so you take this 200 megabyte image file and it is then processed into a very compact cell file format which generally is like one and a half MBs so it's literally almost a 200 to 800 fold reduction in data size so it's a binary format you can't open it in a notepad or plus plus to look at it but it's storing the intensity values the probe sequences the position on the array and these kinds of things but it's not storing the image file so after we get the cell file we track the expression levels using TXT then we do data normalization again we end up with just a basic TXT file and then we do gene expression analysis and data interpretation so I want to talk about data normalization and gene expression analysis because those are the two things I think where we start using statistics right normalization is a statistical technique just like differential expression analysis is a statistical technique so this is generally so if you have your cell file and you extract your expression levels you get a TXT file which looks more or less like this so it's just a big big matrix in the columns you have different samples with the sample annotations saying this was cancer tissue this was liver tissue this was so it's just the sample annotations and in the rows you have your genes and literally you are generally you don't have genes right you have probe sequences but these probe sequences are of course targeting different genes and you can have sometimes 5 probes targeting a gene and sometimes you have 20 probes targeting a gene so then you would take the average of all the probes to get an idea of the expression of the whole gene but in each of these little cells here there's one of these expression values saying that this has an expression value of 15 this has an expression value of 3 and of course 15 is much more intense than a value of so the way that we generally look at two color arrays because this is based on single color arrays where every probe has an intensity but of course when you have two color arrays you have two intensities you have an intensity of the red channel and you have an intensity of the green channel so in this case we have the same format but instead of storing the intensity value what we do is we store the ratio not just the ratio we store the log 2 of the ratio so what we do is imagine that we have three samples in two tissues and we take the ratio so we take the mean of tissue 1 and divide it by the mean of tissue 2 and then we express this as a log 2 ratio so we just take the log 2 of the mean of the first tissue of the mean of the second tissue so why do we do this? well we do this because the dyes have a very different dynamic range they have very different intensity values so the green dye has intensity values which range from around like 5000 to around 20,000 while the the other dye the green dye has an intensity range of around 2000 2500 all the way up to 3000 so by taking the ratio between the red intensity and the green intensity we get like 1 in 8 and on the other hand when we have a fully red spot then we get an intensity of 8 to 1 or 8 divided by 1 and then we take the log 2 ratio because we want to have each step being linear and now if we step from 1 to 2 divided by 1 which is 2 if we step from 1 to 2 then this step is plus 1 but if we go from 1 to half then this is only minus 0.5 and then the next step is minus 0.25 so by taking the log 2 ratio we actually transform these values into a more or less similar range because now if we take the log 2 ratio so the log 2 of half is minus 1 or 1 fourth is minus 2 and the log 2 of 2 is 1 and the log 2 of 4 is 2 so now we have a linear range because now when we step from this one to this one it's an increase of 1 and when we step from this one to this one it's a decrease of 1 and this means that now we can go either way if we take yellow to a little bit green or from yellow to a little bit red we have a similar distance so that is why we look at the log 2 ratios and store log 2 ratios for 2 color arrays of course this is a non-issue in one color arrays so why do we do this because of the different intensity ranges and because of the dye bias because there is an imbalance between the red and the green channel and the dynamic range proves the characteristics of the data distribution it provides us with linear step sizes and it often makes the distribution more normal more Gaussian because we want to have normal distributions because then we can use more powerful parametric statistics on them and it allows us to use classical parametric statistics for the analysis because we want to have normal distributions instead of having like these Poisson distributions where the intensity is like 10,000 in the one probe and in the other probe it's like 100 so it just gives us a better data distribution so when we talk about normalization normalization the definition is the creation of shifted or scaled versions of a statistic or measurement the intention is that normalized value allow the comparison of these values for different samples or data sets in a way that eliminates the effect of certain gross influences like environmental influence or the influence that the dye has that the dye is going from 0 to 100 and the other dye is only going from 0 to 2 so that is why we do normalization when we talk about normalization it is there to remove technical variation from noisy data there are many steps in a microarray analysis so you have a PCR step you are pipetting if you have two color arrays you are merging two samples into one and of course you can never have the exact same amount of DNA in both there are always variants from that and each of these steps in this whole microarray pipeline introduces a little bit of error so the acquiring of samples and extracting RNA just has errors in there and extract the exact same amount of RNA for every sample sometimes you have a million cells other times you have a million and 500 cells and the differences are minor but there are little errors so that is why we do normalization so the assumption is that global changes or cross samples are due to unwanted technical variability so to remove these differences we also have to keep in the back of our minds that when we do normalization we can lose real biologically interesting effects which are aligned with our data and so every time that you do normalization keep in the back of your mind that you might also remove some real biological effects so that you are normalizing to get rid of some of the effects that you think are unwanted but you are throwing away real biology so when we talk about normalization I always have to mention that there are two types of normalization so there is normalization of ratings versus normalization of scores so normalization of ratings means that you adjust values measured on a different scale to a common scale so it is an adjustment where the intention is a distribution of measured values into alignment with a distribution of other measured values and I am talking here about probability distribution but it is just about the distribution of the values themselves and then you have normalization of scores and normalization of scores is when you take your data and you kind of squeeze it into a normal distribution and of course this has a completely different goal normalization of ratings just means measured values on one species like on tigers and tigers are if we look at the body weight of tigers the smallest tiger that we could find is like 100 kilos and the heaviest tiger is I don't know 500 kilos and if you then want to compare that to cats then the smallest cat is around like 3 kilos and the heaviest cat is around like 9 kilos so you can't compare these two things directly so you have to normalize your ratings you have to normalize both to be in the same range so you could probably say that well I am not expressing it as the body weight in kilograms but I am going to say it is a percentage of the maximum value that I have observed and then you are normalizing ratings because you are saying this tiger is the heaviest tiger and this cat is also 80% of the heaviest cat so there is a difference between normalization of ratings and normalization of scores and they have different effects on your data normalization of scores is relatively dangerous in a way because you are taking a distribution which might not be normal and then squeezing it into a normal distribution so if you imagine a histogram of a uniform distribution if you normalize that by scores then you are kind of squeezing a uniform distribution into a Gaussian distribution which of course means that you are modifying your data in such a way that you might not be able to get your original data back normalization of ratings does not have that issue in a way because 80% of the maximum value if you know the maximum value you can always convert back but when you normalize scores then you are really changing your underlying distribution which might be a step which you cannot reverse so be aware of that so for example some ways of normalization and standardization microarrays which have been used in the past and some are still used is for example mean centering so what we do is instead of having a measurement value for every measurement so we have for example a microarray we have measured 200,000 probes and then we calculate the average intensity across all of the probes and then we are now going to say that the new expression value of a probe is the expression value that we measured minus the mean of the whole array that is called mean centering so what this does it makes that if we have two distributions two microarrays both microarrays after mean centering will have the exact same mean because the mean will be 0 for array 1 and the mean will be 0 for array 2 so both of the arrays will have the same mean because we just subtract the mean out of the data a slightly better way of doing it is calculating standard scores or Z scores which means that we are calculating a student T statistic so now what we are doing is we are mean centering but then we divide by the standard deviation that we observed in the group so this is making so that every microarray has a mean of 0 and a standard deviation of 1 right so we are squeezing our distribution again and we say that instead of having values which range from 0 to 15 we now have values in the range of like minus 2 to positive 2 but every array has a mean of 0 and then there is still quantum normalization which are rank based methods so here you rate every probe in order so you are saying this probe is the lowest one in intensity this probe is the 1,2 lowest this one is the highest this one is the 1,2 highest so instead of using the real intensity value that you see you are ignoring the intensity value and just ranking the individual probes from lowest intensity to highest intensity the lowest intensity gets a value of 0 the highest intensity gets a value of 200,000 or the number of probes on the array and of course now because every array that you are using has the exact same number of probes the mean of the arrays will also be the same because the same values are in both data sets so in microarrays I generally prefer quantile normalization so quantile normalization is very similar it is also a rank based method so what it does it allows you to normalize across different samples it is available in the pre-processed core library and you can say normalize.quantiles of my expression matrix so here we have an expression matrix with samples in the columns and measured probes in the rows and here we see some real data which we will be working with in the assignments and this is what we see when we look at the raw intensity values that we get from the microarrays so we see that every array has a slightly different mean you see that some arrays have also some outliers on them because we see that the quantiles this array just probably had something going wrong and here we see that as well but when we do quantile normalization what we are going to do is we are just going to say every array gets the exact same average value every array gets the same more or less standard deviation and it gets the same kind of quantiles so we force the array into the same distribution what we are seeing here and that is why I wanted to point out that when you are normalizing you might lose real biological effects so here we have different microarrays and you can see here from the name that we have HT and GF so HT is hypothalamus so it is a section of the brain which we took out, put on a microarray here we have gonadal fat which is fat tissue from around your gonads and we took it out and we put it on a microarray but if you look closely at this plot then you can see that the HT samples have a mean which is higher than the gonadal fat samples and these hypothalamus samples are the same they have a mean which is higher than the gonadal fat samples and biologically this makes sense in brain there are more genes that are active than in fat tissue so because there are more genes active it also means that more of the spots will be on the microarray but by quantile normalizing we just remove this whole effect we just have to do this otherwise we cannot compare hypothalamus to gonadal fat but by doing so we are actually artificially turning off some of the hypothalamus genes so genes which are actually expressed we push them down because we want to give every array the same mean value but by pushing down the values of the hypothalamus arrays we are actually artificially turning off some of the genes that are actually on the array so we are removing some real biological differences by normalizing and this always happens so every normalization step that you do you have to be aware you might lose real significant interesting biological effects or other effects it's not just biology it's the same for sociology it's the same for political energy because you are measuring stuff and you are then comparing across different conditions or across different countries as soon as you start normalizing your data you might lose real interesting effects because you are normalizing them so like I said often for when we do gene expression profiling the goal is to identify genes that are differently expressed between different groups be it different tissues be it healthy versus disease so we can use many different statistical methods to analyze if a gene is differentially expressed if we have two groups we could just say we are just going to do a t-test if we have many groups we might use an ANOVA where we do an analysis of variants where we look to see if we have five groups if any of these five groups different from the other four but we can also use things like rank product which are non-parametric tests so the non-parametric tests have as an advantage that they are not sensitive to outliers but one of the drawbacks is that they don't have the same statistical power as a power as a parametric test like an ANOVA or a t-test so each test has different applications and it has pros and it has cons and generally you want to already define in the planning of your experiment which statistical test you're going to do because which statistical test you're going to do determines how much power you have determines how many samples you're going to need and of course you can if you know what kind of an effect you're looking for before you do the experiment you can kind of simulate your experiment so that is the reason why I always like to talk about microarrays because there are many of these different angles that come in to do microarray experiments so what we are doing here if we are looking at genes and we are trying to find differentially expressed genes we are actually testing a hypothesis so we have a hypothesis which is that one gene or a gene is expressed in brain and it is not expressed in for example your heart so statistics are used to determine significant effects so a result is called statistically significant if it has been predicted as unlikely to happen by chance alone so what we are hypothesizing is that there is a probability distribution and that we observe a data point and this data point is somewhere in this probability distribution so the probability density is of course the highest at the average value and the further you get away from the average the lower the probability that this will happen so in the end what you are doing is you have an observed data point you have this hypothetical probability density and then you want to see how far away from the mean of the distribution is this observed data point and if it is far away from the mean then we call this more likely to be significantly changed so probability is a measurement of a deviation from random so there is a difference between probability and likelihood probability is a deviation from random effects because we assume that if we just do a measurement for example we measure body weight if we take 100 people to measure a mouse we get 100 different values but the value that we get most often is the real weight of the mouse but sometimes it will be a tenth of a gram less sometimes it will be a tenth of a gram more so you will have a normal if you take 100 people to measure something you will get a probability distribution and the real value is of course the middle of the distribution so what we are doing in statistics is that we have a null hypothesis so our null hypothesis generally is there is no difference the difference that we observe from the one sample to the other sample is just based on random variance measurement variance so and we reject this null hypothesis if this probability the point at which our observed value is if it is in the extreme tail of the very unlikely observations then we call it significantly different and this is generally a pre-defined significance value and in biology we agreed that our significance value is going to be 5% and this is different for each field in biology it is 5% in physics it is 1 times 10 to the minus 5 so 0.0001 because physicists want to be much more certain than biologists, biologists don't really care too much about the significance it's just about a tendency and as long as we see it more or if we see that the difference is big enough then we are going to believe that it's true but for example in particle physics this is not how it works we want to really have a million observations telling us that it is significant that you want to have a million observations that X is higher than Y before we are going to say that X and Y are different in biology that's not the case if we have 20 measurements and 19 of the measurements show that it's higher than we are going to believe it 5% so the smaller the p-value the larger the significance or the more significant, larger significance a little bit weird English but the more significant an effect is right so statistics are used to determine significant effects and likelihood is that the differences that you observed are real so unlikely to have occurred by chance alone so there is a difference between p-value and likelihood so the probability is the measurement of the deviation from random and probabilities always come with a confidence interval so a p-value a a a a so a probability value is a fixed number but the significance a likelihood that something is true always comes with an error margin always comes with a confidence interval so it is the measurement of the deviation from random so I hope I explained that well but there's many other sources that you can use to kind of find the exact difference between what probability is and what likelihood is right so the likelihood is that the probability of something that happens and then if you observe that then you can say that there's a p-value to it good so let's start with the first statistical test I've been talking for 2 hours and 40 minutes and we haven't done any statistical test right so the most basic statistical test that you can do was developed by William Sealy Gossett who you see here so William Sealy Gossett not a lot of people know his name because he published all of his papers under the pseudonym student so when he wrote a scientific paper in the author list it would just mention student so it is a t-test is there to find the difference between two groups and it can only find the difference between two groups it cannot do three groups two in R it's called the t-test function and there are two ways to do a t-test there are two hypothesis that you can test you can have a single sided test where you ask if the A group is smaller than the B group or if the A group is larger than the B group and this is a more powerful test because before we can do a single sided t-test we have to have a hypothesis we have to have data collected beforehand that makes us believe that there is a real that we expect group A to be smaller than group B so for example in our group we study the Berlin Fett Maus and the name of the mouse already gives you the implicit hypothesis that the Berlin Fett Maus is heavier than a standard laboratory mouse so when we test the Berlin Fett Maus is heavier than the standard mouse we use a single sided t-test and why are we allowed to use a single sided t-test because all of our observations in the past which we did using two sided t-test because when we first got the mouse we didn't know if it was heavier or smaller than the standard laboratory mouse so the first time that we did the test we figured out oh yeah no it is different from the standard mouse so we are using a two sided test and now after figuring out it is different we looked at the direction of the effect and we saw well it is consistently heavier than the standard laboratory mouse so now when we do a t-test t-testing the Berlin Fett Maus against standard laboratory mouse we use single sided t-test but we are only allowed to do this because we have prior data giving us a reason to have this hypothesis that the Fett mouse is heavier than the standard mouse if you do not have this prior data you have to use a two sided t-test the hypothesis is just that group A is different from group B so t-test actually don't come in one flavor but so to so one sample t-test is when we have observations so I told you here that a t-test is a difference between two groups not entirely true if you have one group of measurement you can also test if your group of measurements if the mean is different from a specified mean so we see the formula to test this here so we can do a one sample t-test so a one sample t-test can be single sided or it can be two sided as well but this one sample t-test if we have measurements of this mouse we can ask the question is the weight of this mouse 5 grams or is the weight of this mouse 40 grams so we are comparing a single group with a value that we are that we want to test so that's the one sample t-test so a t-test can be done on a single group or it can be done on two groups so you can compare one group to another group or you can compare a group to a value to a mean that you expect it to be so how do we do that we calculate the mean we subtract the mean that we expect it to be and then we divide by the standard deviation divided by the square root of n we get a t-value and then we look into the t-distribution to see where our t-value is and then this t-value will then give us the p-value which is the area under the curve so it's the amount of observations in the t-distribution which are higher or lower which are higher than our observed data point so two sample t-test are the standard student t-test so the two sample t-test is when we have for example the effect of a medical treatment we have 100 people 50 people are in the control group 50 people get the treatment so 50 placebos, 50 treatments and then we want to see if both if we do it unpaired we want to see if the controls are different from the treatment or the treatment is different from the control and in this case we preferably want the two groups to be equal in size and have the same variance we can also have paired sample t-test so this is not called a students t-test this is called a t-test and this is when we have repeated measurements for example we have 50 people in our experiment we measure them before we give the treatment then we give them the treatment and then we measure them after so that means that for every independent sample we have two observations well here for every independent sample we only have one observation so we only have the after placebo treatment measurement but in a paired sample we have the before and after treatment measurements and it is generally better to do a paired sample t-test because paired samples reduce or eliminate the effect of confounding factors so individual things are because if your baseline measurement is 5 then hey you will have a measurement of 5. something before after treatment you might go to 7 if someone else has a baseline of 7 before treatment it will be 7. something and after treatment it will be 9. something so a paired sample t-test is generally better, more powerful than an independent unpaired t-test so because t-tests are parametric tests they assume that the input of your data is a normal distribution so a parametric test is defined as a statistical test that assumes that the input data follows a known distribution so in the case of a t-test this distribution is a Gaussian distribution you also have statistical test which assume that your data is a Poisson distribution or that your data is a uniform distribution these are still parametric tests but it's just that there is the assumption of a certain known distribution from which your data is drawn right so how do we now validate these assumptions so these are the assumptions for a t-test so when I do a t-test then each of the two groups that I'm measuring should follow a normal distribution so you have to test the normality using a Shapiro-Wilk test so in R you can do Shapiro.test on the measurements of a single group and this will tell you if the group is normally distributed or if it's not you also a t-test generally assumes that both groups have the same variants the standard t-test in R actually does not have that the Welles t-test which is default does not assume equal variants but if you know that both groups have the same variants by doing a Levine test so if you do a Levine test which is available in the car package you can do a Levine test on your groups and then it will tell you that yes both groups have equal variants yes or no so if it's not the case then you have to do a Welles t-test which is the default in R and if you have yes so if both groups have the same variants then you can use a more powerful t-test because you can specify in the t-test itself that the var.equal is true of course you can only specify that the variants is equal when you have done a Levine test if you do not do a Levine test you have to do a Welles t-test another assumption of the t-test which people generally don't really check is that samples should be random if you do a t-test between for example a group of patients is there also a way to test for other distributions like Poisson? yeah of course there are for many of the standard distributions like Poisson uniform distributions normal distributions they have their own statistic groups so if you know that your data follows a Poisson distribution just look up what the Poisson equivalent of a t-test is if you have two groups right there's also if you do when we start talking about ANOVA and linear regression you can specify what the underlying distribution of your data is but for a t-test it always assumes normal distribution and it depends so the var are assumed to be equal generally but you don't have to write so for a t-test a student t-test it assumes that the var are equal if not you are doing a Welles t-test but for Poisson there is also an equivalent and there is also for uniform distributions and other distributions so one of the other assumptions which is an assumption for Morales all statistical tests that you do is that samples are random and independent so that means that if you design your experiment and you for example I developed a new drug and I want to test this drug so I go to a hospital and I say to the hospital people coming in with disaffliction for example skin rash when they come in you give them this drug or you give them a placebo and then you send the data to me and then after you did 100 or 200 patients then I am going to do the test and say that my my drug works or not because it reduces rash that's the goal if you do your experiment like this then you are breaking the random assumption because people going to a hospital is not a random subset of the population people going through a hospital are sick and sick people are not representative of the whole population so you should always make sure that you take your samples randomly and this is what often goes wrong in research or not so much goes wrong but which people cheat on so if you have to review scientific papers in the future because you start doing a phd or you continue in science always assume that people are cheating with this assumption because in science people want to find significant effects because those are the things that you can publish doing an experiment which cost a million euros and finding nothing does not allow you to publish and people want to publish their results but in many cases and especially in human research these assumptions are broken because they do not randomly sample people from the population so they say this thing that we found is reducing your body weight so it helps you lose weight and then they only tested it on people who had a BMI above 30 that is not a random subset that just means that the disclaimer on your diet pills should say this diet pill works when you have a BMI above 30 because you can randomly sample people with a BMI of 30 and they should be independent so people that are in your study are not allowed to be related to each other so if you have 100 people coming into the hospital and from these 100 people 5 of them happen to be related to each other then of course you are breaking the independence rule so when you read scientific papers in the future be aware that people cheat especially this assumption that their samples are not random they are selected for something and they don't mention that they make it look like their results are universally true for all humans our pill helps you to reduce weight and then you look at how they designed their experiment and in the experiment there were only males participating with a BMI above 35 that means that you can't say anything on how this pill is going to perform in people that are female or that have a different BMI or under 30 and the independence is the same thing because people like to if you are related to someone else genetically then your phenotypes are also very similar to each other so it's easier to find significant effects so if you don't have this normality assumption to come back to the first assumption your data in a t-test should be a normal distribution not just the one group but also the other group if your normality assumption does not hold then you should switch to non-parametric test because a non-parametric test does not assume your data follows a certain distribution and often these methods rely on rank methods so instead of a normal distribution you have some kind of weird paranormal distribution which is not normal but close to normal so if your Shapiro-Wilk test fails and tells you that your data is not normally distributed you have to switch to non-parametric statistics and this is a pain because non-parametric statistics are very underpowered statistically speaking meaning that you have to recruit much more people into your study before you can get significant effects and that is of course the non-parametric test so that's where the ghost comes back the ghost comes back with the paranormal distribution so one example the non-parametric equivalent of a of a t-test is a Man-Whitney-U test so it is the same as a t-test you can test two groups against each other and hey you do it in R like this so you say wilcox.test I don't know why because the test officially is called Man-Whitney-U but in R it's called wilcox.test and I think that has to do with the guy that invented it or the guys that came up with the test and the one which wrote the code for it but in R the Man-Whitney-U test is implemented in wilcox.test so you can write it down in the formula way so you have your measurements so numeric measurements being 15, 17, 13, 106 and then here you have an A as your Boolean factor so which is true or it's false so false means that you're in group A true means that you're in group B you can also just split your values so your numeric measurements into two groups right so you can say x is the individuals that were in group A and y are the individuals which are in group B so you can use the formula specification but you can also just throw in two vectors of values x is the A group y is the B group and of course because of the fact that it's the equivalent of a t-test you can also say that things are paired so if you say that they are paired then it is assumed that the first measurement in x is measured on the same individual as the first measurement in y the second measurement in x is measured on the second person the same as the second measurement in y so here you can also have this signed rank test which means that you have paired samples so repeated measurements just like in a t-test before a drug and after the drug non-parametric equivalent of an ANOVA and I didn't explain to you guys the ANOVA yet because there will be two lectures about regression but there is non-parametric regression as well so if you have more than two levels right if you have more than two groups then you can do a cross call wallace test so a cross call dot test your measurements and then A is just a factor with different groups group A, group B, group C, group D right and here the alternative hypothesis is that or the underlying hypothesis is are all groups coming from the same population right so the question here is is there any group which is different from the other groups that is kind of the question so hey if you manually you is the equivalent to a t-test like the cross call wallace test is to the ANOVA of course you can have different non-parametric tests so for example the Friedman test is when you have randomized block design so randomized block design is when you have for example a grouping factor and a blocking factor right so what is a blocking factor a blocking factor is when you start an experiment across different fields for example imagine that I have potatoes right and I have so I have the yield of a potato these potatoes belong to different types of plants for example I have the red potatoes the yellow potatoes the potatoes with these little things on the side right and of course if I measure them across different fields then of course I have repeated my experiment multiple times right and then I can use a Friedman test so a Friedman test is specified like this so I have my numeric values of data so the yield of the potato plant so the amount of potatoes that I got then A is the group that I'm interested in right so that is the type of potato I might want to know if red potatoes give more yield than yellow potatoes and then B here is my blocking factor so I did the experiment on five different fields and then I can do a Friedman test using the different fields and then compensate for the fact that there are different fields and that at each field the mean could be different so the Friedman test is again similar to an ANOVA test ANOVA also allows you to do blocking but this is the non-parametric equivalent right so all these three tests the man with NU the Kruskal Wallace and the Friedman test are non-parametric test they don't assume normality which the the t-test does good so I just wanted to have like a couple of slides about correlation I know it's already five but I didn't prepare a third break so I think I'm just going to continue it's like 11 slides left so we are just going to run through it quickly I think many people know what correlation is so it's a measurement of dependence between two variables in R you use the core function and why do we deal with correlations because you have the famous saying that correlation is not causation which is true but I always like to flip it around right correlation does not mean that there is causality but causality cannot occur without correlation because that's the it's the opposite right stuff which is correlated doesn't mean that there's a causal connection but things which are causally connected are correlated to each other right so it's the opposite is also kind of true so I think that that's interesting but since this correlation is not causation is true right why do we deal with correlations well correlations are useful they are very very useful because it indicates that if you measure a you can predict what bay b is going to be right so it's a predictive relationship and you can exploit it in practice and some correlations are very strong and can be exploited very very significantly so there are two different types of correlation Pearson correlation which is the parametric correlation so it doesn't do any transform it is fast but it's sensitive to outliers and it has a normality assumption not only that it also has the same variance assumption like in the t-test so if you do Pearson correlation test if your two distributions are normally distributed or if your two variables are normally if they are you can do Pearson if they're not you have to use Spearman which is a rank based transformation it's slower but it's more robust right and be aware that even though there might not be correlation there might definitely be some dependence between the two variables right so this is one of these pictures that I took from Wikipedia right and here are all distributions which have no correlation correlation coefficient is zero but you can see that there is a very defined structure when you look at X versus Y right so here X versus Y forms like this two smileys on top of each other so there is definitely going to be structure it's just that this structure cannot be assessed by correlation because oh I do the example first so ice cream sales and temperature right so there is a massive inquiry or there's a massive correlation between ice cream sales and temperature the question is is temperature and what is the causality here is there causality right is there is it does more ice cream being sold lead to a higher temperature or does a higher temperature lead to more ice cream being sold or is there another hidden factor which is making these two things correlated to each other right but very basically if you look at temperature and you look at sales then you see almost perfect correlation between ice cream sales and temperature so how do we compute correlation this is the formula for correlation so it's the sum of all measurements where you take the measurement minus the mean in the one group you take the measurement in the other group minus the mean of the second group and then you divide by the square root of the sum of squares times the sum of squares you don't really mean a lot to people so how do you do this if you want to do correlation on paper right you have your temperatures you have your sales so you calculate the mean of the temperature you calculate the mean of the sales group then you define this a term x i minus x which is just the temperature minus the mean temperature you define your b which is the sale minus the mean sale so you get all values which are more or less normally this is a mean centering because it's the new value the new value is the old value minus the mean so these are the mean centered values of the temperature the mean centered values of the sales you multiply these two together you do them to the power of two so you make the square distance the same thing for the b values you get these three values you sum these up or you use these values just in the formula so you say 5.325 divided by this times this and then you get your correlation coefficient relatively simple to compute anyone can do that by hand so Pearson correlation like I told you it doesn't do any transformation on your data but variables have to be linearly related to have a perfect correlation coefficient it's not sensitive to outliers and it has a normal assumption and it's only reliable when you have like 30 measurements or more based on that you have Spearman correlation which is a rank based transformation and in that case you can have a variable or you can have a perfect correlation so a correlation of 1 or minus 1 when variables are monotonic related so what does that mean well monotonic it means that with an increase of the 1 there is an increase of the other 1 linearly related means that an increase of 1 unit of x causes an increase of a certain amount of y units and this is always the same so here you can see that this is monotonic related to each other an increase in x causes an increase in y is different for my value of x it is dependent on the current value of x stepping from 0.4 to 0.6 increases y but not as much as stepping from 0.0 to 0.2 that means that in this case on this plot the Spearman correlation is going to be 1 and the Pearson's correlation is going to be 0.8 because it draws a straight line and the dots are on the straight line all right so we looked at testing you can also do a cora.test so you can also test if a correlation is significant or if it's not but when we test microarrays when we test gene expression data for significant differences so does gene A significantly different between condition we agreed on this p-value so if p-value is lower than this then we agreed that an observed difference is significant if it would have happened one in 20 times by chance right so in this case you can see already why this is not good enough for physicists because this means that we would agree that gravity is real if we would drop an iron ball to the ground if we would release an iron ball and it would fall to the ground 15 times and one time it would fly up into space a biologist would still say gravity is real but of course that doesn't really make that much sense so for a physicist this ball needs to drop to the ground like it can only fly to space one in a million times right but for a biologist every 20th drop flying into space is good enough for us to prove that gravity is real so but multiple testing means that if we do a single test we agree on this p-value so we agree that one in 20 is by chance but multiple testing is we perform many of these tests when we do a microarray right if we have 20,000 genes on our microarray we do 20,000 tests and as the number of comparison increases it becomes more likely that the groups being compared will differ because of some randomness if we preserve this one in 20 threshold we need to compensate for the amount of tests that we perform so the simplest correction that you can do is the Bonferroni correction and that just means that well we did 20,000 tests so now our p-value needs to be smaller than 0.05 divided by 20,000 so that means that only when a p-value is below 2.5 to 10 to the minus 6 we are going to believe that this p-value is significant so there are two ways of thinking about multiple testing because we measure we do thousands of statistical tests we have to talk about like a type 1 error so a type 1 error is saying that a gene is significantly changed even if it's just by chance and we can avoid this by using Bonferroni correction so Bonferroni correction protects us from making this error saying that these two things are different but they are actually not we can also optimize for missing a significantly changed gene because Bonferroni is relatively restrictive so it might be that in the group of genes that we say these are not different that there are actually genes which are different so we can also optimize and say we want to miss at most 5% of the genes which are differently expressed so that is called the type 2 error missing a significantly changed gene so a gene is significantly different but we say that it's not and this can be avoided by the Benjamini Hopper false discovery rate and in R we can simply use the P adjust function so the P adjust function takes as an input the P value that we observed we then tell it which procedure we want to use like Bonferroni and then we tell it how many tests we did so this will correct the P value the fact that we performed 10 tests in total so the 10 is the number of tests that could be performed on the genes we measured and of course the number of statistical tests that you did is always going to be the maximum test that you could have done right if we measure 20,000 genes we do one test and then we say oh the P value 0.01 so this is significant that is not true we could have done 19,999 other tests so the number of tests is not the number of tests that you really did but the number of tests that you could have done and adjusted P values smaller than 0.05 are considered significant so how do you now write down these P values in papers so the P when you write P equals it is always italicized so make sure that you do that you never use a 0 before the decimal point when you write down your P value so you don't write P equals 0.05 that is wrong because it cannot be 1 and it cannot be 0 it is always a value between 0 and 1 the P value cannot be 0 because that would mean that you have an infinite amount of observations and all observations show a difference it is the same thing as for 1 you have an infinite amount of observations and all observations show the exact same thing so a P value cannot be 1 and a P value cannot be 0 that is something to keep in the back of your mind so if you write down a P value which you have adjusted write it down and show the adjustment method so you write P B F for P Bonferroni you write P BH for P you write P FDR when you use the FDR method to correct your P value so always when you write down a P value italicize the P capitalize it and in lower case when you have adjusted your value tell the reader what adjustment method you used you want to keep the number of digits consistent in your paper so if you wrote your first P value into the paper as 1.0 times 10 to the minus 10 you use 2 digits behind the comma you do the same thing for the next value and the same thing for the next value and the same thing for the next value and this is really annoying when you look at papers and tables that sometimes they use like 3 digits behind the comma, sometimes you use 2 sometimes they use 5 and that's wrong always take the exact same number of digits in your paper as that you did of course this only holds for when you have a times 10 to the minus something in there if you use scientific notation of course if you have 0.05 or 0.001 like in this case then if you write it down without the sign then of course you have to vary the number of dots that you have so when you write down P values in plots it is generally preferred to use symbols right so we use the dot for suggestive when the P value is between 0.05 and 0.1 we say that something is significant with a single star when it's in this range 2 stars is when it's between like 1 in 1000 and 1 in 100 and we give 3 stars when it's between 1 in 10,000 versus 1 in 1000 3 microarray data I'm going to give you some microarray data for today so the assignments today come with a small microarray data set if you think like I want to do more R coding right I want to get more microarray data sets or I'm interested in cancer lung cancer tissue versus healthy cancer tissue or I'm interested in I don't know plant development so I want to look at arabidopsis lansberg erecta versus arabidopsis colombia plants and I want to know which genes are different between them these are two sources where you can get free microarray data so just as a kind of ballpark estimate microarray costs around somewhere between 40 to 80 dollars so if you go to gene expression omnibus which is ran by NCBI just fill in gene expression omnibus in google press enter they have 25,000 experiments stored there are like 600,000 microarrays in there which you can retrieve for free so that means that just a single website is giving you in the order of like 50 so that's already like 30 million dollars worth of data to spend any money you have array express which is similar to the gene expression omnibus they have slightly less experiment they have more arrays but the nice thing about array express is that they have the gene expression atlas so this is a subset of their data which is manually curated and re-annotated so they look at the data and made sure that the microarray is really done on the tissue that the scientist said that was right anyone can upload data into gene expression omnibus and anyone can upload data into array express but if you look at the gene expression atlas so a small subset of experiments in the array express they are manually curated and re-annotated so this is really really high quality data and the nice thing is is that array express also provides storage and retrieval but also analysis so you can already kind of on the website click around compare different experiments compare different arrays so kind of get a feeling if there is something really there but yeah I would invite anyone if you are a little bit interested in biology and you want to kind of do your own research and like the people that do all of these conspiracy theories say like do your own research go on Facebook and read this go to gene expression omnibus download the raw data do your own analysis or go to array express download microarrays and do your own analysis and these are really good resources so definitely use them when you have the chance good that is it for today are there any questions and I have to wait a little bit of course because like we are a little bit behind on YouTube I can see more or less what you guys are seeing but so shoot perfect alright so no questions from the video so who is still here who is still here I see seven viewers still here so that means that it is me my moderator Misha who just hangs around what about when you have negative and positive data in the same file what do you mean by negative and positive data Shas what does negative data mean are you talking about negative and positive values because that that's not an issue right as long as it follows a normal distribution like a normal distribution has negative values and positive values as well I love it when you call me moderator well you are you are it's just that you didn't have to do anything like a cyclic experiment with data on minus y axis and minus x axis so the cdf is less zero in middle then you mean what kind of statistics should you use because in this case if you do not know the probability density function if you don't know if it's if you don't know that your data is normally distributed for example then of course you can't use parametric statistics so you have to use the Friedman test or cross covales test because in the end the values that you measure are just values that you measure in a way right so if the measurements tend to be negative pizza doesn't go into a microwave yeah well in theory you don't have to right as long as you know what kind of a distribution your data follows and if the test that you use is supporting that distribution like there are many different distributions right like if you think about like linear models and ANOVA and GLMs or for example if you have like a binary outcome right like zero or one right so you can think about an animal dying or not within a certain time period right and then this outcome is not a it is not a quantitative variable it's a binary variable but also for binary variables you have statistical test which deal with that so you have test which can deal with Poisson distributions beta distributions gamma distribution so as long as you can get an estimate on the distribution that your data follows then you can always kind of shove it into more or less a generalized linear model but we will have like a whole topic on generalized linear models so that is fine is there a way to determine significant values from one data set who differ so much from the other value that they cannot be random no because then they're called outliers Leonardo so generally we consider values which are three standard deviations away from the mean if your distribution of values is a normal distribution we consider those values to be outliers so we generally deal with those by either leaving them out putting them to missing or we windsurize them into the distribution like we talked about last time so because in in a single day so in a single well it depends on how you call a data set right so if you have done a single microarray then if you go back to this microarray little example that we show you see that it shows these outliers so it shows these values which are very very different from the distribution that we see in this one microarray so these are gene expression values which are too high relative to the distribution on the rest of the microarray so in this case you would either have to do a normalization of the microarray relative to other ones or you would have to kind of blank out these values right so you would you would put your finger on them and you would say they are NA or you windsurize them into the distribution by saying I give it the highest value that I have instead of the other one but generally in statistical testing you are dealing with more or you are dealing with two groups or three groups or four groups and you want to know what the differences are between the different groups because you are not going to you might ask the question is there a gene which is so highly expressed that it is different from all of the other genes in a sample but then that generally is considered to be an outlier and a highly suspect case but I can think about that and I will also think about your question Ciclic experiment with data on CDF is less than zero in the middle yeah now I have to think about that but in this case if you do not know your probability density function you should switch to nonparametric statistics and in a way I would always say that you want to run both right generally when I do analysis because one of my main analysis things that I do is quantitative trade mapping where we associate a phenotype with a region of the genome I use a linear model, a generalized linear model which assumes a certain probability density function and in parallel I just run a nonparametric model and if I see a highly significant effect using the linear model but it is not significant using the nonparametric model then I'm going to look and investigate what's going on are there outliers is there heteroscladacity is there something else going on why am I getting such a big difference because in theory if you switch from parametric test if something is significant it should be significant in both of these tests and that is of course that you want to have independent you want to have both of them being being significant before you really trust an effect and I would say that being highly significant in one method being significant in the other method is highly suspect of something being wrong at the marker that you're asking a question or something is wrong with how you define your groups or one group is having like a weird outlier or the sample size in a group is very small compared to the sample size in the other group so that does happen a lot but always use two tests the parametric test and then use a non-parametric test and if both of them say this is highly significant there is no issue but if one of the two says this is highly significant and the other test comes back saying that no, this is not significant at all then you really have to do a post hoc investigation and really start looking into why is there such a big difference between these two tests and generally when I do generalize linear models I use non-parametric lasso regression as my kind of backup method to kind of make sure that both methods are giving me the same answer perfect good, so then for me I would like to thank you guys so, so, so much I never expected to have such a big reception on the course and ending up with a thousand subscribers I'm still like blown away by it and it wouldn't have been possible without you guys without the pandemic so at least something positive came from the pandemic so I wish you guys a very, very good evening and I hope to see you guys next week and have a very, very good evening and thanks for being here and see you guys