 the recording. So welcome everyone on YouTube and on Moodle if you're watching this not live. First off, exam dates. The exam dates will be on the 21st of July and the re-exam will be on the 9th of the 9th. Make sure that you register two weeks before. Currently seven people are registered and since we have like between 12 and 16 viewers I think we should have at least a couple more people registering for the exam. So do that. All right so the assignments. So let's go through them. Today's like the student data presentation will be on the 15th of July. Yes, yes as far as I can see it will be on my birthday. So I will probably be wearing my birthday hat and eating cake during the entire lecture. I'm sorry that we can't do it in person because then you could have all be eating cake but not a birthday suit. No, we're not going that far. No, no, no, no. All right so let's switch to Notepad++. Let me scroll up a little bit. Is it readable or should I make it a little bit bigger? I think a little bit bigger might be better, right? I hope that everyone can read it. So lecture nine starts off with using the data from lecture number six. So the first things is like the first things that we did we already did in lecture six, right? So I'm just going to very quickly run through it and explain why we are doing what we are doing but I think that if you would have done the assignment six assignments then you should be able to load in the data and you should be able to take things like the log two transformation. But first off just what are we going to do? So the first thing is that well we are loading in some data, right? So the first thing that I always do when I load in data is do some preliminary plots like do a box plot show me what is in each of the columns. But because of the fact that you want to look at the data you want to see it and that's the nice thing about R because it really quickly allows you to do some explanatory data analysis. So let's get me an R window and just copy paste in the first couple of lines to load in the data. So of course I'm loading in the data. I'm setting check names to false and that is because when you open up the data set in the text editor you see that the columns in the data set array data so the data which has the micro array data in there the column names start with a numeric value and that is not allowed because column names have to be proper variable names and a variable cannot be called zero one two, right? Because variables are names so names always start with a character and not with a number. So just quickly for you guys how does this look? Well we have our array data. I will show you the first 10 lines that look somewhat like this. So you see that we have a description. So we have things like the bright corner and the dark corner so the bright corner is the positive control the dark corner is the negative control and then we have all kinds of probes which we have measured and of course probes which have been measured have a sequence so that we can find where in the genome this sequence occurs and then we have all kinds of measurements so just the intensity values that we get off our micro array. So as you can see we have names like ht 2010 and if we want to know what ht 2010 means then we have this other we have this other matrix called arrays which is kind of a description of the different names, right? So we have a file name so this is the name which kind of came from the machine then we have something which is called the atlas id which is mapping to the data file here and then we have things which is strain so that means what type of an animal is it is it an inbred mouse or is it a f1 so a cross between two inbred mice and you can see that we have three types of strains so we have bfm mice which are the berlin fat mice and we have a b6n which is the standard laboratory mouse and besides that we have f1 individuals which are a cross between the two. In the tissue column we see two different tissues right so we see ht for hypothalamus which is a little part of your brain and we see gf which is the gonadal fat which is the fat around your gonads and furthermore we have then the individual id and you can see that the atlas id is just a combination of the two but the individual id is the id that we use in our mouse house so that's the reason why we have the individual id is there. Alright so the first thing that we want to do of course is make a couple of box plots but of course we can't just do a box plot of the whole array data because the whole array data has this sequence column in there so there's two ways of doing this and there's a way which is easy and that's just saying like oh take array data right and then drop column number one so just say minus one and this will drop the first column of course this is not really scalable right it might be that in the future we get a new data file or we add some individuals and the new data file might have like two or three columns before the data starts so the better way of doing it is to say well we have this atlas id column in arrays right so we want to just use this column as the index to the array data so how can we do that so we can say something like well take from arrays the column which contains the atlas id right so instead of throwing away the first column what we are doing is we're very explicitly selecting the individuals which are in this column and then using this column as the selection criteria for the big data file which has all the measurements in there so if we look at this and let me look at a very small one to ten right now we see that we have a matrix which only contains numeric values and we lost our first column so two ways of doing it but I don't really like the minus one system because that's not really scalable while this is really scalable because you're just selecting the individuals from the arrays and then using these individuals as an index to your data set which I think is a little bit better all right so of course like I told you guys first we want to do a box plot so I'm using again the same system this is kind of big right so we could have just said something like this and say well I do want to define a new variable called samples right and then instead of using a raise atlas id every time I could now just use samples and that's of course a little bit better because now we have a new variable which holds this singular column but since it's not too bad I'm just gonna leave it as it is so the first things first we are going to make a box plot and this won't surprise you guys because we already did that so let's switch back to the R window it will take a little bit of time and the time is taken because there are so many outliers so we can see that we have measurements which are around zero and the maximum measurements are around four times ten to the power of five which is like 400 000 so you can see that the range of the data is very big right we can't see the box even for the box plot so that means that the data is definitely not a normal distribution right so the first thing that we want to do is kind of draw this distribution because we have a lot of the probes on the array which are kind of off because most of the genes in the genome are not expressed at a given point in time well some of the genes are expressed at a very high level so the reason or the quickest way to solve this is to take the log two so just do a transformation so what I'm doing is I'm just saying well apply and then would it be suitable to use Windsor eyes in the case of these outliers no no because let me show you guys why probably Windsor rising is not going to work because if you do a single plot right so if we take instead of the all of the animals we just take the first animal right and we just do a basic plot or now let's do a histogram right so if we look at a histogram right then we see that there's like a massive amount of data here let's add a couple more breaks like do a hundred breaks on the on the histogram and so you can see that there's a lot of genes which are not expressed and then we see that there's some which are expressed and then we we almost don't see anything here why something like log two versus log 10 well in this case it's more or less a given thing but Windsor rising won't work because then we're throwing more or less all the data away and this distribution is never going to be a normal distribution so let me show you the difference between a log two transformation so if I do a log two transformation on the data for the first individual and I show you a histogram right then it looks like this and this is still not a normal distribution right you can still see that there are it's it's kind of a two-part system right so you see that there's a lot of genes which are not expressed and then you see genes which are expressed but now you start seeing that this starts becoming more of a normal distribution a log 10 will do more or less the same thing but the reason why we generally take a log two is because the distribution of a log 10 and a log two are similar right you just take a different base number so the only thing that it does is change the scale on the bottom and for micro arrays doing a log two is kind of the the default so that's just it's kind of a historical thing if people who did micro array analysis would have started with log 10 transformations then we probably would have all used log 10 but the thing to remember is that it doesn't matter if you take a log two or a log 10 but the thing is is that it just changes the scale because instead of saying y equals two to the power of x or saying y equals 10 to the power of x the scale is just like a scale a scale of five right so here the maximum is five well if we would do a log two then we expect the maximum value to be around 15 so that's just because the two and ten there's like a five times difference in there but of course we can see that it's still not a normal distribution not a perfect normal distribution but this is the thing that we kind of have to deal with right and we we know this we know that a lot of genes are not expressed and that that's not bad it's just when we start modeling it we have to keep it in the back of our minds that what we are working with is not a normal distribution all right so let's go back to the code and here in the code we can see that because I don't want to do just a single individual right like I did before I'm just going to say well I'm going to take my data right which is just this incantation saying array data take the samples and then I'm going to apply to the columns a log two transformation and then the thing which I'm going to do is I'm going to directly put it back in the array that I had so I'm going to override the array with the override the data that I had so of course if I would do this two times or three times then that would not be correct I can only do this one time because I'm destroying the original data by just copying the the log two transform data over the untransformed data but in this case it should be fine all right so let's do this and then make our box plot again right so to see if the box plots look a little bit better and of course they should right because we already looked at the histogram but if we do a box plot then we see now that yes indeed we see more or less what we saw before but now we see a nice kind of normal distribution it's that you still have the whole bunch of values on the bottom so all of these means if we pull it out a little bit or let me just do las equals two to flip the axis around and now we can also read the names of the individual arrays so las equals two just means go from a horizontal horizontal text on the x-axis to a vertical and so we see that the mean expression across the arrays is around five and we can see that the maximum is around 16 almost 20 and we see that on the bottom there are not really outliers but that's just because these this this massive amount of non-express genes is kind of pulling the whole distribution down because normally if you would have a real normal distribution you would expect the the waxes so these these things which stick out of the box plot to be the same length so if you would just look at a box plot and you see that okay so ahead this one has a very good normal distribution more or less from five to around like two and a half and seven and a half but then you see that the distribution has a very long tail on the top and it has a very short tail on the bottom so and then normally would if the waxes would be the same then it would be a real normal distribution what we can furthermore see is that we still have some outliers and of course these outliers we could windsurize away right because now we're left with only a couple but we're not going to windsurize we're just going to take the standard approach for microarray data and the standard approach for microarray data when you see that the means of all of the microarrays are a little bit different and i think last time we already talked about it that you can see that in the hypothalamus the average expression is slightly higher than the average expression in gonadal fat and this has just a biological cause so by doing a normalization yeah because we want to have a normalization because we want every array to have the same average expression and kind of the same range in the expression so by doing the normalization we are losing some real biological information because from this from the biology standpoint if we would see a picture like this we would conclude that well in brain more genes are active than are active in gonadal fat but it's the way that it is we have to normalize because we have to get rid of this more or less effect where every array has a slightly different mean because if we would do modeling on this then we would find like thousands and thousands of genes being different between the different samples and that's kind of that's that's not really true because that has just has to do with the amount of DNA that has been brought on the microarray so let's do a normalized quantiles which is the standard way of normalizing it so again the same thing i'm i'm just gonna copy paste it in so here we see the normalized quantiles function which comes from the preprocessed core package and this will normalize our data to make sure that every array has the same mean and that the quantiles so the the first quantiles and the second quantiles are more or less smooth it out across all of our arrays and again i'm going to do this destructively so i'm just going to copy it back over the original data directly all right so let's do that and then do our box plot again to make sure that the data really looks good and then after that we see that indeed now it has normalized the way all of the differences because every micro array now has the exact same average value and it has the exact same variance of course we now lost some biological evidence because here we now see that the gonadal fat samples get pulled up and the hypothalamus samples get pulled down a little bit but we can't really do anything about that besides splitting the data set in two and keeping the gonadal fat samples separate from the hypothalamus samples which if this would have been a real analysis you probably would do hey you would probably have to do the gonadal fat and the hypothalamus sample separately all right so the next step is then to do your linear modeling right because have we the lecture was about linear modeling so the question was really long right so use linear modeling to detect differentially expressed genes and then make sure to compensate but i want to show you guys how i would normally build this right so the first thing that i would do if i want to do a linear model i would say well i need to have one set of measurements right so the one set of measurements that i'm going to take is just the the third one on the array right because the first one is the is the positive control the second one is the negative control so i'm just going to take the first one which has some real data in there so that is number three so if i just look at number three right then i have a vector of data right so you can see that it just has measurement values and of course it because i'm taking one row of a matrix it kind of automatically becomes a vector so i can call this why right so i can just say why is this thing right so this is the thing that i want to predict now what do i want to use to predict well the thing that i want to use to predict is i want to know if the strain or the tissue has an effect right because we have strains so we have bfmi so fat mice we have standard laboratory mice and we have different tissues because we have gonadal fat and hypothalamus so had the first thing that i'm going to do is say well i'm going to define my strain right so i'm just going to take the strain column here from arrays and i'm just going to give this a name so i'm just going to call this strain right and then i'm also interested in not the strain but the but the the tissue so let's take the tissue column out and put this in its own variable called tissue all right so now i have everything i need to do linear modeling i have a response variable and i have two different predictors one which is the strain one which is the tissue so the first thing that i'm going to do is see if everything works i'm just going to type lm and i'm going to say well take my why as my predictor and now use for example strain so oh and then i get a invalid variable list for why and that is because it is still a matrix it's still a matrix which has one row and many columns so let's solve that and just say as numeric right so i'm just going to force it to be a numeric value and now when i look at why it's now really a vector so it lost the the column structure and the row structure it now just values um and now when i do my linear model um it will tell me well this is interesting strain here is now interpreted as a factor right because it has three different levels um so it has b6n bfmi and f1 and when i do my model i see that well it gives me back what what i gave it right so the formula is the prediction of the micro array intensities based on the strain um and i get an intercept back i get a strain bfmi effect so this is the bfmi effect relative to the default group and fortunately for us the default group that it chose is actually the b6n right because i have three groups but i only get two effects back so i get an effect of the bfmi relative to the b6n and i get an effect of the f1 relative to the b6n so these effects when you use a a factor variable or a variable which is type factor um then it will always give you it give you the values relative to the base value but what the base value is is something that you can determine yourself um because if we wanted to get the effects relative to the bfmi we have to do something like this right so we could would say that now when we look at the strain i want to just say well this is a factor right so i'm now going to force it to be a factor and i'm going to say that levels is c and then the first level that i specify is going to be my default level so i have bfmi b6n and i have f1's and then i'm going to do it like this and now when i do my model it will now give the effects relative to the bfmi so i can determine what is going to be the default if i don't specify a default it's going to be the first thing that it sees in the column is that clear i hope so right so you can see that the effects just get flipped around but of course the f1 effect is now completely different than before and that is because it's it's now relative to the bfmi but of course i want to have it relative to the b6n because that's our more or less laboratory mouse so the default mouse so let's flip that around so i want to have bfmi as well so let's put in the bfmi as well right so now i can do my model again and now i see that everything is nicely relative and now the f1 is mentioned first but that doesn't really matter right you can see that the effects are still exactly the same it's just that the order has changed all right so let's extend our model right because strain is not the only thing that might influence our phenotype it might also be that the tissue is influencing it so we can just add tissue as an effect so we can say linear model where we say that the expression of this one single probe on the microwave is determined by the strain of the animal and the tissue that we're currently looking at and then we see that when we do that we see that the f1 value doesn't really change to be if my value doesn't change and the tissue doesn't really and the tissue is the new effect that we see but because these two estimated parameters haven't changed at all we know that these things are not collinear so when we designed our experiment we did a good job and we didn't measure only hypothalamus in f1s and only brain and only gonadal fat in for example bfmis and so when effects start changing between two models right so from a model where you have one effect compared to a model where you have two different variables predicting if you see a big change like we saw in the last lecture then there is something which is called collinearity but in this case because the effects are the same we can assume that there's no real collinearity in our data which is good because that's kind of what we want to see all right so we have the coefficient you see that the intercept changed a little bit and then we have our effects of course we're not that interested in the effects we want to know if there's anything which is significant right so if we want to look at significance then I'm just going to save my model called m1 or model or my model but m1 is because it's the first model that we're making and now I can use the ANOVA function on m1 and then we see that the strain although it has an effect right because there's a difference between the strain compared to the bfmi or compared to the b6n so compared to the reference mouse we see that there's no real evidence for this to be significantly different what we do see is that based on this first probe on the array that we're looking at we see that there is evidence significant evidence that the tissue is playing a role in the expression of this gene so that's good to know right so in this case the the mice strains are not different so all three mice have more or less a very similar expression but what we learn is that the tissue is very different from one tissue to the other tissue all right so now I've set everything up to start doing it en masse right because I don't have one probe on my micro array I have a bunch of probes like 20 30 000 probes so that's the first thing that we need to do so the thing that I'm going to show you now is the the code that I wrote before and you can see that it follows the exact same structure as what we had before right so instead of so we're taking why we're doing the strain we're doing the tissue then we're doing our model and then we just do the ANOVA on our model and I'm taking out the PR column right so the probability column because I'm I want to know the probabilities and I'm not so much really interested in the effects because first I want to know which of these 20 000 genes are different and once I know which of these 20 000 genes are different I can then drill down in these genes and see what the difference exactly is so I'm just going to remember the probabilities so the way that I'm going to do that and like I first wrote this piece of code what I'm going to do is I'm just going to go put a big for loop around it I'm going to say for x in one to the number of rows of array data so just go through this whole big matrix one by one and of course I need a variable to store my results in and I call that results and of course every time that I do this model right because if we go back to r quickly right and I would say do this probability f thing right because what I want to get is I want to get this column out so if I look at this column oh it's called m1 right then now I get the three different p values why do I get three different values one which is missing right this is because we also have the residuals here so the residuals they don't have a probability because they're just there so the probability will always be an a but there you can see that I can just use these two values so the thing that I'm going to do is I am just going to for every row in my big matrix so my array data I'm just going to take these values so this part and I'm just going to row bind it right so I'm just going to make one big matrix which contains all of the results because for every every test that I do I get three values back so I'm going to have a matrix in the end which has three columns and of course the number of rows of this result matrix will be equal to the number of rows of the original matrix so when I look at notepad plus plus this is what I kind of did here so you see that I have my results variable and every time I do my model and then I take this this column from this matrix and I just row bind it to the results that I already had all right and then of course I can add the column names so I can add well the first the first column contains the p values for the strain the second column contains the p values for the tissues and the third column contains the p values for the residuals which will always be an a but still we have to give it a name and then of course the row names of the result will be the row names of the array data right because originally because when I'm just row binding it won't give me any row or column names it will just make a big matrix but I have to know in which row which probe I did the test for all right so let's run this this is going to take a little while um so um let's go to the r window and let's just run it and this will take some time so are there any questions so far or is everything clear right so you see that when I when I build up something like this and the question was pretty long yeah because it it was like I can read the whole question but that's a little bit nonsensical as well um but here you can see that I do a step one by one so first things first I'm going to just take one of the things put it in y and then model it using the strain and the tissue and once I have working code in r I generally copy paste the working code from r to notepad plus plus and then put a for loop around it just to do it for all of them yeah but originally I always look at one if it works for one it's probably gonna work for all of them all right so this will take a little bit of time so because it's a big data set right like there's there's a lot of genes in the in the genome and for each gene we need to do our test no question so far that's good was everyone able to do this um I know that Testosaurus my new moderator that I just promoted to moderator actually did it right were you able to finish the whole thing with the two tips that I gave you I hope so what why the smiley face you were you were either able to or you were not able to solve it right in the beginning you just not finished but almost okay but like in the beginning you just try to do too much and I think that's very common when people start programming is that they want to do too much in one go right so they are they they instead of thinking themselves and breaking the problem down into very small sub problems and then trying to solve the individual sub problems what people generally do is when they're faced with a big question like this they go to google they google they end up on stack overflow stack overflow gives them like 30 lines of code they copy paste these 30 lines from stack overflow and then start modifying it to try and have it work for their analysis but it's generally not the way that you want to work the way that you want to work is really start from nothing and then do it step by step so take a single marker or take a single probe on the array and model that and once you're able to model a single probe on the array then just write a for loop around it and do it for all the probes on the array taking a while though and that that's on a like six core machine although I'm not using six cores I could have brought in in parallel as well but that's gonna be a little bit more complicated. Hi mom yes twitch life lover you're hi well I'm not your mom but someone has to say hi back right otherwise it's just not not fair all right so we're waiting and we're waiting um so actually while we're waiting I got a new toy and I wanted to show you guys the toy um and since we have some time now um I'm just gonna play with it so I bought myself one of these drawing boards because I can I can draw stuff for you guys so I'm also doing the the like new layout so this will be the layout next year so next year is going to probably very likely oh let me I have to switch it here then as well so the layout next year's probably going to be something like this and of course then we will have like pandemic edition um like three show the picture and let the chat guess the age of the artist which picture you mean you okay chat what is the age of the the person who drew this why did it move I don't like it moving like I know that like I like it a lot like it's it's clear and it works really well um let me let me switch back to this one this is the new coffee break gift for next year 25 30 oh my god how old do you guys think that I am actually 26 that's a that's an interesting like you guys know that on the 15th is my birthday right I told you that it's from a child guys no it's not from a child 31 then he is 31 years old you're 35 oh my god you guys are so so so generous that's that's so generous I I like the fact that you guys all think that I'm like early 30s I'm not early 30s not not not by a long shot I will be 38 but yeah the new drawing board so next year we'll probably use it a lot because I like it a lot right because I can do things like just use a pen and then I wasn't that far off no no no no but but next year like instead of using the board behind me to do all kinds of scribblings that you guys almost cannot see I can just draw like a little graph for you guys and then like the the difficulty here is that with the new like tool I or the toy it's not really a tool it's more toy um but I can do like dots and then I can make like a nice graph and I can show you guys like oh look there's a like linear connection between two things um but that's just for next year and um it actually finished so very good very good very good so let's go to the our window all right so we see that it finished that's really good so let's put away the toy um and um if we look at the results right and this is the first thing that I do so I just say give me the first 10 rows of the result yeah so like like we predicted we can see that the residuals are na um we see that there is p values in here and we see that the tissue is is looking like it's very very significantly causing differences and we already knew that right because just by looking at the first box plot I will come back next year well I hope not if you pass the exam there's no reason to come back next year um you're more than welcome to of course but you're more than welcome to but um if you pass the exam there and there's no real need unless you want to learn the same stuff again or kind of again um but we'll have to see um I there will be more new stuff of course because r does change all right so let's let's continue with the assignments because otherwise we'll be drawing and talking the whole time which is fine but um I do want to finish the stuff here because then we can switch to the nice stuff right I will for sure be around again next year why do you say that general gulag don't you don't you trust yourself that you'll be able to pass the exam like it's not going to be a very difficult exam um and actually this year your guys are probably very lucky because it will be an online exam meaning that it's slightly easier than doing it in person because you don't have me to kind of distract you guys and walk through the rows and watch while you're writing stuff um oh just to freshen up knowledge okay yeah no that's always a good idea um of course like the videos are on youtube as well so you can just freshen up on youtube but um I I I would love you guys to be here next year as well like the more people watching the more fun it is for me as well all right so now we can start answering the questions that were there right so the first question that we had was how many genes are differentially expressed between bfmi and b6n mice um that is actually a question that we cannot answer with the analysis that we just did because we only know if there's a strain effect so let's see if that's something that we can answer later on um why did I ask that question because that's a question that you can't really answer by using linear modeling because you have to do postdoc tests then anyway it doesn't matter so the first thing's first right because we did 20 000 tests or somewhere around that we have to adjust for multiple testing so adjusting for multiple testing can be done using the p adjust function so i'm just going to use the p adjust function to adjust um based on the amount of multiple tests that we did so i'm just going to say well take the results take the strain column take the tissue column and then just use the p adjust function to adjust for the number of tests that we did and the p adjust function will use the most um restrictive um adjustment so it will do pomfroni correction by default um but you can tell it to do different methods as well so it has a parameter where you can tell it well i want to use benjamin hoogberg right so it depends on which error you want to optimize but in this case we want to optimize for the type one error because if we say that this probe or this gene is differentially expressed then we want to make sure that that's really the case right and we don't want to optimize for kind of a 5 false discovery rate no we want to be very stringent because we have a lot of genes so it's better to lose a couple that are really different but make sure that the ones that we have are super significantly different compared to the other way around because we don't want to end up with like thousands and thousands of genes that we have to look or further investigate so i'm just going to use the the adjustment function so i'm going to say adjust my strain results and adjust my tissue results and then overwrite the results that i had um so i'm just going to do this and when we then go to r so in r i will just keep this part here so adjust the p values so now when we look at the first 10 results you will see that it has changed a lot right now all of these which had non-significant p values before now end up having a p value of exactly one and you can see that some of the other ones which were not that low now also got adjusted back to one but you can still see that there are some probes on the array which are still different right so how many are still different well if we want to know so if we want to know how many there are we can now just use our standard threshold of five percent right so i'm going to say which of the strain which of the strain p values are lower than 0.05 and this will tell me how many genes are different between the different strains that we are looking at um and of course when i do this it will tell me a big vector of true false true false but i want to know how many there are so i'm going to say which ones right because this will give me the indexes back so if i do it like this it will tell me which ones are differentially expressed but i'm interested in how many so i'm just going to ask for the length right so then i have a single number and it tells me that there are 418 which are differentially expressed between the different strains so if i want to do tissues right i can do the exact same thing but instead of using the strain column i'm going to use the tissue column and that tells me that there are 17 000 genes which are different from gonadal fat to hypothalamus and that is a lot right we know that a mouse has on average like 25 000 genes um but of course in one case we have brain tissue where a lot of genes are active and in the other case we have fat tissue in which only a limited amount of genes are active um so it might be true that there's really 17 000 genes or probes which are targeting genes which are different in this case all right so the question then becomes um what is the major pitfall when looking at genes differentially expressed between the two species so anyone in chat what what is the major pitfall here and what what can we do wrong and we have some solid quiet air time for you guys to think all right so let's go back to the drawing board i'm i'm just gonna draw this for you guys right so that you uh oh all right so let me just wipe this away no don't do it like that give me the pen again i don't know why it changed i had i had practiced it a lot but now all of a sudden the entire layout in in in office it has changed so let's just get rid of all of this stuff all right so we're looking at a gene in three tissues right so first off we can have something like this right so imagine that when i look at the bfmi uh no we have two tissues right so if i look at the bfmi right and if i look at the gene then it might be that the gene is in gonadal fat and we have hypothalamus i'm not that good at this so it could be that here in gonadal fat in the bfmi we have very low expression and in the hypothalamus we have very high expression of this gene but now when we look at the b6's right we could have again we have gonadal fat and we have hypothalamus we could have the opposite direction of effect right so in the bfmi it's low in gonadal fat high in hypothalamus while in the b6 it's high in gonadal fat and low in hypothalamus of course when i combine this data right then one of the major pitfalls that we can run into is that kind of in the model this gets averaged out and now how does it look well if we look at the tissue uh if we look at the effect right then the effect does consider the different tissue but in the in the end what it will do it is it has like two values here and it has two values here now when i compare this then it will kind of notice not the difference anymore because in one case we have a positive beta estimated and in the other case we have a negative beta estimate and now these two can cancel each other out and now we don't find an effect so in this case this might happen in our data right because that that might be the case um so we might have an effect where in the bfmi we see a positive relationship where hypothalamus has higher expression than gonadal fat and the other way around um we can see that in b6 we have a lower head but now by combining these two data sets and giving the model both the data of the bfmi and the b6 when it does a test to see if gonadal fat is different from hypothalamus it has measurements for gonadal fat in both bfmi and b6 and it has the same thing for hypothalamus of course but the problem now is is that these two average out because the positive beta in the bfmi is canceled out by the negative beta in the b6 so this this is something that can happen and this is one of the major pitfalls when you do more complicated models not a model which is dependent on a single thing but on multiple things and this is something that is really really hard and really really difficult to to kind of compensate for so there are other ways to compensate for this um and one of the ways that I generally try to compensate for this is that to is calculating a model for the tissues and then correct the data first for the tissue effect or do it for the different strains right depending on what we want to do right so first things first I'm doing the same thing but now I'm using the residuals and then taking the intercept and adding that back in right because we know that once hey when we do a model a linear model where we only have one effect right the tissue effect then we have a single straight line through the data and then we can have our residuals and then we can use these residuals to make an adjusted phenotype so we we pull out one of the effects and by pulling out one of the effects first and then looking at for example doing a t-test between the three different strains we are much better able to figure out exactly what is going on so and this is just the way that you can do that so I'm just going to show you hey because here I'm just wanting first to adjust my measurements so I'm taking again the measurements calling them why then I'm making a model where I say that y is determined by the tissue and then I have my model and now I take from my model my residuals so this is the phenotype after correcting for tissue and I'm adding back the coefficients the intercept because that's just the mean of the probe that I want to add back in because I don't want the expression to be centered around zero but if the expression used to be seven then I want the average expression to still be seven so I call this phi core which is the corrected phenotype and then I just make I just bind this to my corrected data so what this does is just it runs through the whole matrix again one by one and then what it does it it adjusts the phenotype based on which tissue we are looking at and then I can just use basic t-testing because now I only have one factor left and this factor has two levels so then what I'm going to say is well which of the strains are BFMI which of the strains are B6 and which ones of the strains are an F1 and now I'm going to go through the corrected data and then I'm going to take the corrected values for the BFMI and then do a t-test between the BFMI and the B6M and then I'm going to do the same for the F1 to see if the F1 is different from the BFMI and I can also test of course the F1 versus the B6N and here I'm going to combine them first so because now I want to have two p-values here so I'm going to say do the t-test remember the p-value combine this with the p-value of the other test and then I'm going to say this is a new variable called one role so this is the adjusted data t-tested BFMI versus B6F1 versus BFMI and then I'm going to add it together and then I'm just going to give the column names at the end after I made my matrix and this is a slightly better way of doing it because now we can't have this pitfall where two of these things cancel each other out because in one tissue or in one strain you see a positive tissue effect and in the other strain you see a negative effect between the tissues so this is another way of of doing it but that we are first taking a correction so we take our phenotype we remove one of the effects and then we do t-testing we could have also done this using linear regression again but in this case since we only have one factor left we can then just do a t-test between the different groups of this factor and then of course we have to adjust for multiple testing again and now I can actually answer the question that was in the assignment and say how many genes are different from BFMI to B6 and how many genes are different between BFMI and the F1 individuals all right so let's run this code and then we can see how the corrected data and stuff looks like as well right so then we can make some plots again this will take some time I see that my original moderator is back as well so welcome to the lecture as well I hope your meeting went okay I actually modded Daniel as well I should have modded Skrita but I don't think that Skrita is here today so that's a shame I want to be really evil and but all right so and so we're correcting so again this will take some time oh let me switch to our window so we're just correcting the data so just copy pasted the data in so and head this will correct our data and oh she is good good then I can unmod you and mod Skrita right but then so correcting data is a valid way of getting rid of an effect and head like I showed you guys in the simple drawing is that if two effects are opposite to each other you cannot find this effect because they cancel each other out there you are very good all right so it will take a while do we have any more assignments I think these were all of the assignments right I think those were all um I had some free questions there oh yeah we are still at question 4b is it valid to use linear modeling on this data set that should be something that you could answer right because we looked at the data um so the question is is it um is it okay to use linear modeling here so this is just the yes no question so you can just throw yes or no in chat and like you don't even have to have done the assignments for this I will start I will start I will say this is my answer a lot of people sleeping today yes oh testes ours I'm so sorry I am so sorry but no of course it's not valid to use linear modeling here earlier one of your award-winning moderators for dumbness time me out and I don't know why explain yourself mod yeah explain why did why did you why did you time out one of the viewers because there was no reason I think for a time out um but of course it's not valid to use linear modeling here um when we looked originally at the data um what we what we saw is that there is no normal distribution right we see that there's a big hump of of of things which are zero and then we see that there's a very kind of broad um uh normal distribution this is what happens when you give your students too much power yeah yeah I know I know I shouldn't give uh see how insulting he is he's not insulting anyone and how do you know he's a he or even like assuming his his gender yeah anyway no timeouts for for twitch life lover twitch life lover is good you can you can watch and you can stay and that's so minor yeah all right it's still correcting the phenotype data but um so based on the original data that we showed you guys right so if we look at the histogram across um we can see that the histogram is is is is problematic it's not a normal distribution so linear modeling in this case will probably be not be the correct thing to do um because if a gene is not expressed in one tissue but it is a little bit expressed in the other tissue then we already get a significant p-value which is why we ended up with 17 000 genes being different between the two tissues which is just too much um it can't be that between brain tissue and fat tissue there are 17 000 genes which are different um so testes out as 3k shouldn't be a mod to be honest you're not even considering giving him like a second chance like he he he timed you out i'm agreeing with you he shouldn't have timed you out at all but of course it's his first day as a mod so he he he he probably just didn't know how to do how the controls work so i demand an explanation and he called me evil that's actually true that's actually true um he's now reporting from mobbing what why why i'm i'm i'm going to uh um do uh something like uh time out all right let's let's put testes out as in time out let's focus on the lecture i i i agree right we adjusted data is finished right so the data is now adjusted so now if we look at the corrected data 1 to 10 and we look at the original raw data which is stored in let me see um array data arrays right so we're just going to look at the array data we're going to say 1 to 10 comma 1 to 10 comma throw away the first column right we can see that they're they're we adjusted so there is an adjustment and you see that some hypothalamus samples actually got adjusted more or less up right because it was or down it was 1.4 and now they're adjusted to like 1.3 um and you see in this case that the gonadal fat samples haven't changed at all um and that is of course because one of the groups is going to be the um is going to be the uh base group all right no one's going to go to goodbye so don't don't fight in chat like everyone can make mistakes it happened um all right so let's let's look at some of the adjustments so let's take the third column and do a plot right so these were the original measurements oh that's not what i wanted let's do it as numeric i'm having a hard time typing today so and and let's me do x-axis is normed axis right so i'm just trying to make a nice plot for you guys so that you can show the samples and then the adjustment um so the first thing is is that i want to why is it not disabling the x-axis x-lap x-axis this should disable the x-axis i don't know why but um we can put axis one at is one to the number of columns of uh array data and what do we want to put there we want to put the call names of array data there and now it's over plotting them which is a little bit annoying and actually i have to do minus one to throw away the first because we have the sequence column in there so i don't want to have the sequence in there so let's do it like this and then say la s is two to flip them around all right so then it looks like this so of course because it's the it's the artist palette model right so i'm just doing one by one adding things to the plot um and it just over plots them but i can just close the plot and then say well give me a new plot why is it not disabling my axis actually oh yes it's in the ass numeric that's the reason all right and i want to disable the x-axis okay so what i'm saying is take the original data throw away the sequence column and then don't plot an x-axis and then i'm going to say make a custom x-axis so axis one at the x-axis add one to the number of calls of array data minus one yeah syntax highlighting yeah that's why i try to do most of the things in notepad plus plus notepad plus plus has syntax highlighting also for r so it just looks a little bit better right so here you can say that it has kind of minor but it it's okay ish right so here you can say that length and which are there but normally a lot of people and i think also a lot of students they use r studio um so you don't have that issue because i think r studio has syntax highlighting within the text editor there um but all right so but let me show you how the adjustment worked right so we just add the axis i'm just going to make the window a little bit smaller so i'm going to add the axis right so here we see the individual things and i'm still not liking that this thing is here so i'm just going to say don't give me an x label either and then just add the axis right so here we see the original measurements for hypothalamus gonadal fat more hypothalamus more gonadal fat and now i'm going to use points and i'm now going to take the same gene right number three i have a question on plotting i heard that it's useful to use source on save when plotting and adding many things and start the plot from zero yeah yeah so that's that's kind of what i'm trying to always do i'm always trying to not use the basic r plot i i generally make an empty plot and then start adding things to the plot um i don't know what you mean by source on save when plotting because i i don't think that base r has a source on save um anyway so i have my corrected data now which i want to add to the plot so from the corrected data i'm also going to take the third one in this case there's no sequence in the corrected data so i can just do it like that and then i'm just going to say color is um green uh that's a little bit of a shame so use the plot to the axis and then make the green dots but do pch equals 18 so make them filled right um and green is probably not the best color so make them red all right so now here in black we have the original values and here in um red we have the adjusted values so hey you can see what happened so when i did the adjustment for this gene it chose gonadofat as the kind of default group right so gonadofat didn't get adjusted and you can see that by the fact that none of the had that the green dots and the black dots exactly overlap but the adjustment here because i did a model and i take and one of the two groups gets a beta right you can see that the beta for hypothalamus just moved all of the points from the hypothalamus up to the mean of the of the gonadofat group and that is how this kind of adjustment works right because i have two groups so what it does it calculates the mean of both groups and then one groups doesn't get an adjustment in this case gonadofat didn't get an adjustment because it's the first group and then the second group so hypothalamus got the adjustment so this just removed the effect of the tissue and by removing the effect of the tissue we're better able to see if there is a difference between the different samples so between the different strains um i won't google your source on save because i'm i'm not too sure what source on save is or source on save ah source on save is something which is um um unique to our studio i read it online and it was in german so oh okay i can read it in german so it's a function in our studio oh that sounds very very smart actually to do that that sounds very smart that sounds very smart to do you can't do that with the standard r so you can't make a um um but yeah our studio because it is aware of the different data sets that you have loaded in if you make changes to your data set it can more or less um figure that out right so if it looks if you plot like the first row of the matrix right and then you change one number in the first row of the matrix our studio picks that up and then the source on save thing means that it will just automatically update the picture um but basic r can't do that all right i've been talking for one hour and 13 minutes now so um i'm just going to stop my recording for youtube so youtube will be right back um and for you guys