 Let's do a sound check. Sound check one, sound check two, sound check three. Should be okay. I think so. Still getting these weird, weird, weird error messages from YouTube. Anyway, should be fine, should be fine. I think everyone can hear me, so perfect. All right, then welcome everyone. Let me actually switch so that you guys can see me as well. So I'm very excited. Today we will have linear mixed effect models. So, perfect. Fits in really well with what we did last week. Last week we did linear models. So when you have data gathered on a couple of variables and hey, you're having done everything perfect. So there's no, there's no co-linearity. You have measured every individual once. Hey, welcome moderator, welcome, welcome. So, yeah, today linear mixed models. So for the drawing, I made a sun that's really, really warm. I've been sitting in a little room the last couple of days with seven other people taking a part back. And in my office, it's the same thing. It's just not doable. It's doable when the window is open, but unfortunately it's not because then we have all of the noise. So, let's get started, right? I'm excited. I hope you're excited. We're already one minute over time. So I created the sun, which is very warm. And then we have of course the little mixer for mixed models and a little answers lecture seven. I think that's wrong, right? This is lecture nine. So it's gonna be answers lecture eight, but I hope you guys can forgive that. But yeah, I'm excited because linear effects, linear models, it's my specialty, like I mentioned last time. So I hope that everyone was able to do the assignments. They were hard. You can open the window. I don't think it will be that bad with the noise. That depends a little bit on how noisy the neighbors are. I don't know. I think it's more noisy inside of the building than outside of the building, but I'm just gonna rug it for now. And if it really becomes too bad, I will get like a one and a half liter bottle of water. I already finished my smoothie for today. So I already drank way, way, way, way, way too much. So we'll see how it works. Good. So the answers to last lecture. So that was lecture number eight. Let's do like we always do. So I need an R window. I need a notepad plus plus window. So the R window can go because we wanna have the notepad plus plus. So again, we start as always. The nice thing of course is, is that we will, in the assignments, we start off by saying we will reuse the data we used in assignment six, load in the provided data sets using retable or read CSV function. So fortunately, I don't have to explain this anymore because we already did it before. So let me just copy paste the text from the last lecture. So that is here and then we have my window here. So of course, we need to pre-process core library again to normalize our data. And then we just load the array data and the arrays. So the array data is the data captured by microarray and then we have the arrays and that is the description file where for each of the array, it is written down what type of animal was put on there, what tissue was put on there and these kinds of information. So that was the first part. So do we need to normalize it? Oh yeah, because we did the first assignment for lecture number eight is the same as for lecture number six. So create a box plot showing the expression levels. So we'll do that. So again, like in two lectures ago, we just call the box plot function. We take in the array data, but when we looked at the array data, we saw that the first column is containing the sequences. So the probes that are on the array. So I'm just going to use the row names of arrays because those are the real arrays, right? The sequences or any other metadata, that's where the real measurements are. So we'll just take out these columns and then make a box plot. So let's quickly go to R and show you guys how this looks. So I have to find my R window, there we are. All right, takes a little bit of time, right? Because it's unnormalized data, which means that you don't even see the box plot, right? The box plot is all the way down here and there's so many outliers because like the intensity data itself is not a normal expression. So let me quickly show you guys the arrays, right? So if we look at the first five rows of arrays, then we see that the first, well, it's not. So this is the metadata. And then if we look at the array data, first five, then we see that the first column contains the sequence of the probe. And of course, this can't be plotted. So we have to get rid of it. So we use the row names of the information file to select the proper columns from the array data. So that's why we do this little trick where we say array data, row names, arrays, and we select the columns. So we select all of the rows, comma, and only these columns. You, then we need to normalize it, I think. So let me close the R window. So the normalization is first taking a log two to get the data into a more normal distribution, right? So this is a normalization aimed at making the distribution of the data more normal. And afterwards we will take the normalized quantiles function and then we will normalize the, then we will normalize the rest of the differences that are there. So I'm just gonna take the expression data and make a box plot again. So I'm just doing the log two, storing it there. And here I don't need to do this because I'm already doing the subset here, right? So I'm taking a subset here, taking the log two of this subset. So in this case, the sequence column is gone already from my data. All right, let's go back to R, right? So let's do a box plot. And now we can see that indeed our data looks a lot more normal. There's still some distributions which have outliers, but these will probably be gone after normalization. But here we can see that the HT samples, hypothalamus have a much higher expression than the guineatal fat ones, which is a real biological phenomenon, but it's something that we can't really deal with at this point because we haven't really learned how to deal with different types of tissues into a single model. You would, all right, then the next step would be is to normalize quantiles. I think that's also in the assignment. So we'll use normalized quantiles. Yes, so question one C. So let me get the normalized quantiles. We already did this before, right? So I'm just going to copy paste in the code and not type it because why waste a lot of time when it's beautiful weather. So we take the expression data, we make it into a matrix because it's still a data frame. Then we use the normalized quantiles function and we save it an expert data norm two. I'm just going to override my original, now I'm not going to override it, but I'm just going to say dot norm, because in the previous assignments when we did the live edits, we had the norm. So in the previous assignments, when we had the data, we had two different normalization methods because someone was unable to load the normalized quantiles function. So, all right, hello, hello, hello, da vida. Last time we only had a little inference from the wind. Yeah, but like, I think the audio quality matters. So I'm just going to rough it. Good, so we take the expression data, we normalize the quantiles and then let's do a boxplot at the end again so that we can see that it really changed. So that would be really good. Right, let's take the R window, we put it in and we do the normalized quantiles. Fortunately, this goes really, really fast. If it takes a long time to do this and it takes you a long time to normalize data or for example, making the boxplots, then it's always useful to save, for example, the plot or save the data after normalization. There are a couple of normalization techniques which are actually really computationally heavy. So you then don't want to redo it every time but you would just say, write out my normalized data values to a file and then I would start a new script which then starts with reading in the normalized values. But in this case, normalized quantiles is pretty quick and afterwards we can see that the data has the same median everywhere. The quantiles are the same and the whole range of the data is now similar from first array all the way to the last array that we have. All right, so then the next step is going to be the boxplot we did that. So then we are at question two already. So this is a Q1. So let me actually go to Notepad again and then add some question one A through one D. And of course, this is exactly the same as what we did previous. So question two. All right, so now we want to use a linear model to detect differentially expressed genes. Make sure to compensate for one, the correct covariate strain and or tissue and use multiple testing correction. Additionally, try to use quadratic linear modeling which might cause some problems on some small data set. Go nuts, try what you want and try and get an answer to the following question. Good, so let's first go back to R and in R we want to look at our arrays file, right? Because now we're really interested in what's really going on, right? So we want to know what tissues are we looking at, what strains are we looking at and we want to compensate for that. So the first thing that we see is that we have the same individuals in hypothalamus, gonadophat and these kinds of things which of course is relatively annoying, right? Because one of the things in linear modeling is going to be that samples are independent. And of course, taking the brain from a mouse and taking the gonadophat from a mouse of course is not independent, right? We are looking at the same mouse. So for this problem, we need to use linear mixed effect models, right? So we can't solve this issue. At the end of the lecture we will be able to because we will have tools to deal with the fact that not all measurements are independent of each other. But for now we can't, right? So in this case, right? Since I'm only allowed to use linear models, what I'm going to do is just split my data set into two, right? Because I can't really compare the gonadophat samples easily to the hypothalamus samples. So I could try it, right? But then we would run into an issue because then our P values would be overinflated because we have the same mouse in the analysis twice, right? We have 2010 and 2010. So that would cause a big, big issue. Good. So first question that we want to answer is how many genes are differentially expressed between BFMI and B6? Of course this is conditional on which tissue work we are looking at, right? Because it might be that in brain there are 1,000 genes which are different and in gonadophat only 10 genes are different. But first let's look at the hypothalamus, right? So I'm just going to say arrays, oh, let me switch you guys back to notepad. So I'm just going to say arrays and I'm going to take the tissue column out. So that's tissue with a capital. And I'm going to say tissue is hypothalamus, right? Because I'm interested in that one first. I'm going to ask which arrays are these and then I'm going to say row names of arrays and I want to have these ones, right? So the row names of the arrays where in the tissue column it is written HT and these are my HT dot samples, right? So my hypothalamus samples. All right, then the next thing that I want to do is I want to then make a linear model, right? So to make a linear model I need to know a couple of things but I don't need to make a linear model for all of the sample or for all of the arrays or for all of the probes in one go because I need a for loop, right? Because I need to model for each probe on the array because I cannot model for all of the probes on the array at the same time. So I have to have a for loop. So I'm just going to say for X in one, two the number of rows that I have in my expert data.norm. So in my normalized expression data I want to build up a model, right? So I want to build up a linear model and this linear model should take the expression of the probe. So and let's first make a variable for that. So I'm going to say expression. So the X, let's just use XPR. So I'm going to say export which is the expression of the probe that I'm currently looking at which is of course from the expression data normalized I am going to take row number X, right? So the first row, second row, third row all the way down to the bottom and then I'm going to make a linear model where I'm going to say that my expression that I'm observing is based on the strain that I'm currently looking at. Of course, I have to define the strain as well. So what am I going to do? Well, I have my expression data, right? So I'm going to take, and I do need to make a subset here because I need to take my hypothalamus samples only, right? I don't want to take all of the samples just the ones for the hypothalamus. So now I need to know which strain these hypothalamus samples are. Fortunately, I have my arrays, right? So I can from the arrays, I can say HD, HD dot samples. And then I want to take the column called strain and I'm just going to put that in the variable strain. All right, now I can pose my linear model saying I have my expression and the expression is done by the strain. But if we look to R quite quickly we can see that there's actually three different strains, right? We have the B6s, we have the BFMIs and we have the F1s. So the question is, if we just do this test, right? Then the question is going to be is any of these three groups different from the other one? So I might end up with genes which are similarly expressed in the Berlin FAT mouse compared to the reference B6 mouse and the F1 individuals having a much higher expression or the F1 individuals having a much lower expression. So we also need to subset so we could throw out the F1 individuals but I'm going to keep them in for now, right? I'm just going to say I don't really care if the F1s are higher because in the end I just want to know that the BFMI and the B6s are different. So why not keep the F1s in first? It will cause some probes to show differences but in the end we can always look and do a t-test like a post hoc test to figure out if it's really the BFMI versus the B6. All right, so I'm just going to do an ANOVA, right? And I'm just going to first test my little code which I wrote in the, oh, my code which I wrote in the for loop. So I'm going to go to R and I'm going to say X equals one. So well, not one, let's look at probe number 500 on the array, right, why use the first one? We can just use probe number 500 as well. Good, and then I'm just going to take my little piece of text from notepad plus plus I'm just going to test only the for loop. So I'm going to do one, one expression. So I'm going to do one differential expression and indeed I did not copy paste in the hypothalamus samples so it doesn't know which ones those are. So I'm going to do that, right? So you set X to be 500. I take the hypothalamus samples and then I'm going to say expression data X, hypothalamus samples, put it in a new little variable called expert and then have the strains and do the ANOVA. All right, so here we see that for this one we see that strain takes two degrees of freedom which is actually true because we need one degree of freedom to estimate if the first group is diff or if the second group is different from the first and then we have a second degree of freedom which we use to test if the third group is different from the first. If we would look at the linear model itself it would tell us the beta effects. So it would say that well, the standard group so the B6N in this case has a mean of 1.7, the B6 is 0.3 units higher and the F1 is slightly lower but second very, very minor in how much lower it is. Good, so the code within the for loop seems to work so then we are just going to go back to notepad and we're going to just say, well now we need to remember our p values so let's quickly go back to R. So in R, when I do the ANOVA, right, I get this ANOVA table back. So in this case I want to know what the strain effect is and what do I want to know about the strain effect? Well, I want to know the probability using the F test so I want to take this row from this little matrix and I want to take this column and then we see that indeed we get the proper p value. So I'm just gonna take this code and I'm going to get back to notepad and I'm going to say that this here is my p value and I have to then do p-vols is an empty vector and I'm then going to add the p-value that I just computed to my p-vols vector, right? So I'm just going to take the p-values that I have and add the newly computed p-value to it. All right, so let's go back to R and do the whole for loop. So this will just go through all of the samples on the array and it will do our linear model and then it will remember the p-value. So this will run for a little while but in the end we get an idea, right? So in the end we get a number because we can take our p-values and we can, well, while we wait we can just continue the code, right? So I can just say p to the just, p-vols and adjust using false discovery rate. I'm going to be very relaxed today because it's really warm. So I'm not going to do like a Bonferroni correction. I'm just going to do a false discovery rate correction, right? So this will put my false discovery rate at 5% so that means that out of the values that I find 5% might be false, but that's okay. It doesn't always have to be very precise. All right, so then I've stored this back in p-vols.adjusted and then we would be okay. All right, so then I want to know how many probes there are that are more or less different. So I'm just going to say which of these are below 0.05 and then I am going to ask for the length, right? So just to know how many probes there are, right? So this is our linear model for epitalamus tissue and we test if B6N, BFMI or the F1 group is different. Here we are doing adjustment four. Why are you guys looking at the R window? You should tell me. Yeah, we are still seeing your terminal. Thank you, Leonardo. That's indeed true. I'm wondering what's going on. What's going on? That is strange. That is strange. Now you can see it. Stupid delay sometimes. But yeah, thanks for reminding me. All right, so we adjust. So we say p.adjusted our p-values and then I'm just going to say take the adjusted p-values. Which ones are smaller than 0.05? I ask the witch and then I do the length to get the number of numbers. Yeah, now we see notepad. Yeah, it was just, I was thinking you are seeing because I was looking at my little machine here. But the YouTube thing, when I looked at it, I also saw my R window. Good, it's actually done, which is good. So let's go quickly back to R and now are you guys seeing R? Yeah, you should be seeing R. I'm just going to look at my own program. Then I can't see the YouTube chat, which is also a little bit annoying. This used to be a lot better on Twitch. Like Twitch, you can integrate with OBS. So I just had one window where the chat would be, where the prerecording would be. But with YouTube, it's not that easy. All right, so let's adjust the p-values. FDR is written by small letters. So let's do the adjustment and then we see that there are 1198 genes, which are 1198 probes, right? Because a single gene can be targeted by multiple probes. But there are 1198 probes in which one of the groups is different. No, OBS, no, no, no, no, I'm using OBS. The problem is with YouTube is that I cannot get the YouTube chat to show in OBS. While with Twitch, I have this plugin where it allows me to have my preview in OBS next to the chat window from Twitch. YouTube is not as good for streaming as Twitch is, but YouTube has better discoverability in a way. All right, so 1100, almost 1200 genes or probes, which are different between the two groups, but of course, we have to be careful here because these 1200, they, of course, are not the answer to the question, right? Because the original question was how many genes are differentially expressed between BFMI and B6N? So that is different, right? Because the answer here is one of the three is different from the other two or two of the three are different from. So there's one group which is different. So for this, I'm just going to say, okay, so now I know which ones, according to my linear model, are differentially expressed. And you can already see that this number is much lower than we got the last time, right? Last time we just did a T-test, we just did a T-test between hepatolomus and gonadal fat, and we didn't care about the different strains and the different differences, right? But by using a linear model, we can, we do a subset, so we only look into one tissue. And within this tissue, of course, we are not bothered by the fact that there are more genes expressed in hepatolomus compared to gonadal fat. So we now know for which genes there is a difference in the group, right? So there's a difference in one of the three groups. And this is going to be nice because we also have the second question and the second question is how many of the genes are different between the BFMI's and the F1's? So we can do this in one go, right? So let's go back to Notepad, which you see now. So we're just going to say for i in this, right? Because I now need to do a post hoc test. So I adjust my p-values and then I ask which ones of the adjusted p-values are significant. And now I'm just going to go through these, right? So now what I'm going to do, well, I'm going to take the expression data of i taking the hepatolomus samples and I'm just going to take the strains again for these. So I have my expression, I have my strains and now I want to do two t-tests, right? I want to know if the BFMI is different from the B6 and or if the F1 is different from the BFMI. So I'm just going to say strain is BFMI. I'm going to ask which ones are those and I'm going to take those from extra and this is BFMI. Then I'm going to take the B6 ends and I'm going to do the same thing for the F1s and I'm going to update the variable names so that they reflect what is in there. All right, so now we have our three groups, right? And now we can just use a basic t-test to figure out which one is different from which. And I'm not going to do this across all of the probes, I'm just going to do this for the probes which show a difference in the linear model. So I'm just going to use a linear model, oh no, I'm not going to use a linear model, I'm just going to say t-test and take the BFMI, test it against the B6 and I'm going to t-test the BFMI versus the F1s. And then I'm going to say this is T1, this is T2 for t-test one, t-test two and then if T1 dollar p.value is smaller than 0.05, right? So because I'm only doing a single test now or do I still need to do the multiple testing? It's a big question. So because I now have a prior, of course, because I know from my linear model that there is a difference. So the question is, no, because I'm still doing multiple tests, right? I'm still doing this amount of tests. So I'm going to remember that, right? So II is length this, so I'm going to call this n.test. I don't know if this is statistically sound but I'm going to just do it like this. So I'm going to say p.adjust. So adjust the p-value for the first test using, again, false discovery rate and then the number of tests that I'm doing is the number of tests. I have to look that up because I probably just, yeah, n, that is the parameter. So I can just say n equals n.test. So that should be okay. Right, so I'm just going to adjust for the number of tests that I did and then I'm just going to check directly if it's okay and if so, then I'm going to say n .bfmiB6 is nbfmiB6 plus one. And then I need to do the second test. So if the p-value of the second test, then I need to say bfmiF1. All right, so I'm just going to add one to a counter and going to count through all of the samples which are there. All right, so here I'm forgetting to open it, right, so I'm just putting my, like I saw that when I typed this lower one, I saw that it had no matching bracket, right? You can see that here that it doesn't highlight red. So I knew directly, oh, there's something missing. So I looked here and I saw that this one should be closed by this one, right? So I went after this one, it highlighted this one red. So I directly could figure out that, oh, here in the if statement, there's one missing. Right, and the same thing is here because if this one is missing, then this one should have one missing as well. And now when I am behind this one, it is red, which means that it closes the for loop. So just as a show you guys how I kind of find the missing bracket, right? So because directly when I typed the closing bracket, I knew that there was something wrong because I saw that it didn't become red. So it didn't close anything. Good, so we need to define these two variables. So initially there are no genes which are different. So, and there are no genes which are different between the BFMI and the B1. Good, so let's go to R, let's run the code. So, and now in the end we can say N.BFMI, B6. So there are 96 genes which are different between BFMI and B6. And if we go to the F1s, then there are 40 genes which are different between BFMI and F1. And now you guys might be asking, but wait a second, wait a second. There were 1100 genes differentially expressed, right? Because the number of this was 1198. So where did all the other ones go? Right, and this is just because of the fact that a t-test doesn't have the same statistical power as a linear model. And of course I'm asking a slightly different question. The original question was, is one of the three groups different? And now I'm asking the question, is the BFMI group different from the B6 group, or if the BFMI group is different from the F1 group? So I'm asking slightly different questions, but in the end you can see that by first eliminating all of the genes which the linear model doesn't think are differentially expressed. And then using this as a prior to the second part of the analysis where we do the post hoc test, we end up with a much more manageable list of genes, right? We now have 96 genes which we need to investigate further to figure out the difference between the BFMI's and the B6's. And we have only 40 genes that we have to investigate when we want to look at the BFMI versus the F1. Of course there's still the third group right where the F1 is different from the B6. So let's also add that case in. It wasn't in the question, but it is something that people might ask. So let's go back to Notepad, right? And then in Notepad we add the last possible case, right? So we are going to say T3 is our T test between our B6 and the F1, and we are going to call this T3. And I'm just going to duplicate this part which I already did. So I'm going to say T3 adjust for the number of tests. And then if that is the case, I'm going to say B6 F1, right? So B6 F1 is then going to be one more. Of course I could have remembered the probe as well, right? I could have said this thing is an empty vector and then do it like this. So add one for the BFMI F1 and the B6 F1. And then here I am going to just remember which one it was, right? So and which one it was is of course number I because we're using I as the iterator. Because we're not going through the whole matrix, we're only going through the rows which we already know are different. And then I can just remember it like this and I need to do the same thing here and I need to do the same thing here. And of course I just quickly need to update the names of the variables so that they get stored in the proper one. All right, so let's run this again. Save it. Copy, go to R and paste it in. Object B6 not found, that is true because B6 is written with a small b. All right, let's try this again. Right, so now when we look at the N BFMI B6, still 96, BFMI F1 should still be 40. And then if we look at the B6 versus the F1, then now we see that there's only one gene which has sufficient evidence to be differentially expressed between these two groups. All right, so in total we started off with around 1200 genes for which there was evidence. Extra tablet with your stream. I already have enough screens in front of me. It might be an option though, might be an option, Misha, but I don't know. Like I try and focus as much on this window but I have my recording software there. The chat is also there. But I already have multiple windows open. So I don't think having an extra window would help. I think it would be actually a little bit more annoying. I would rather have like a plugin for OBS which allows me to show the YouTube chat, like with Twitch. Good, so first question, or first question. How many genes are differently expressed between BFMI and B6? So we have that answer now, so that's 96. How many genes are differentially expressed between the F1 and the BFMI? That's 40 as an additional one. The question is how many genes are differentially expressed between B6 and the F1s? That's only one gene. And then the last question is how many genes are differentially expressed between gonadal fat and hypothalamus tissue? So this one is again something which we can answer. In this case, we only looked at hypothalamus, right? So we only know like half of the whole equation because we never looked at the gonadal fat. But the nice thing is we built up all of this code, right? And then here I'm going to say a post-hoc test for figuring out which are really significant. All right, and now we can normally when I write code like this, then I'm like, okay, so I already write like 68 lines of code but it's not really 68 because we start here and here we start at line 26, right? But it's a big piece of code but I want to now do the whole thing again for gonadal fat. So there's two things to do this, right? So it's just saying, okay, I'm just going to duplicate and I'm going to change all of the hattes by GFs but then I'm duplicating a whole bunch of code, right? So normally what I would do is I would just say, no, I'm going to split up the part in functions, right? So I have a function here, which does my linear model. So I'm just going to make this a function. Just going to put the keyword function around it and I'm going to change the indentation. So this is a lin model. And what does lin model need? Well, lin model needs the expression data normalized, right? Because that's what it takes here. So this is one of the input parameters. The next input parameter is going to be the HD samples, right? But it's, in this case, we want to generalize our code. So I'm just going to say delete this, right? Because we're now not looking at the HD samples anymore. We're just looking at a list of samples that we input. And then in the end, what do we want to get back? So we want to return the pvols from this function, right? And now the whole thing would be pvols. I'm going to call my function lin model. I'm going to give it the expert data.norm, so the normalized expression data. And I'm going to give it samples. And which samples do I want to give it? I want to give it the hypothalamus samples. So now if I want to do the exact same thing for gonadofat, I just duplicate this line. I use gonadofat samples. And here I duplicate my line. And I'm going to say which tissue is gonadofat. And these are going to be the gonadofat samples, right? So I add two lines here. And I add two lines here, like one for the linear model, for the function definition, and one for the bracket to close the function. So I added four lines of code. But I saved duplicating the whole thing, right? So this is going to be pvols.hd. And this is going to be pvols.gonadofat. And then I'm going to do the same thing, because the adjustment is, of course, we can do that in one go as well. Yeah, why not just adjust the whole p values directly from the linear model? So we're going to say p.adjust. And how do we want to adjust? Well, we want to adjust using false discovery rate. And then this one can just go. So and this is going to be compute and adjust our p values. All right, so the number of tests is going to now be the p values, ht adjusted lower than 0. I'm not going to put dot adjusted behind it. So this is going to be n tests dot hypothalamus. And then I'm going to do the same thing for gonadofat. And then I'm going to turn this into a function as well. So I'm going to use the exact same strategy as before. So I'm just going to say, well, I'm going to make a function. And I'm going to just put the brackets. Then I'm going to go all the way down. I'm going to press stop to use the orientation. I'm going to close my function. And of course, I now need to return something. So what do I want to return? Well, I want to return a list of the number of BFMI B6. I want to have the number of BFMI F1s. And I want to have the number of B6Ns F1s. And then in the second part, I want to have, for all of these three, I want to have the list of, because I also have the indexes of the probes. So I might want to return that as well. And then, of course, since you can only return one thing in R, I'm just going to use the list keyword. And then I'm going to say, well, the first element is the N genes, or N probes. Let's use probes, because that's slightly more accurate. And then I'm going to say BFMI B6 is BFMI B6 comma BFMI versus F1 is BFMI versus the F1. And then I'm going to say B6 versus F1, B6 versus F1. So this is going to return me a list with this list as four elements. The first element of the list is going to be the number of probes that are different. And then in the second type of the list is going to be the indexes of the probes that are different between BFMI and B6. Then we have the indexes of the probes different from BFMI and F1. And then we have the number of probes different from B6 to F1. All right, so this should just work out. Of course, we need to figure out what needs to go into the function. OK, so into the function, we have, these are all declarations. So we have here p values adjusted. So I'm just going to call this p values. And I'm going to remove the adjustment. It takes, again, the expression data norm. And it takes, again, hypothalamus samples. So these are going to just be samples, because I'm going to do it not just for hypothalamus, but also for the native fat. So it takes the expression data normalized, the p values, and the samples. And then we have expert and strain. And this statement here is depending on expert and strain. So we have all of got the expression data for, or so we already have the input parameters for that. And for the rest, I don't see anything which is dependent on a variable which is not passed already into the function. So I'm just going to say post hoc. That's how I'm going to call my function. All right, so now I'm going to call this post hoc function twice. So I'm going to call it once for hypothalamus. And I'm going to call it once for gonadal fat, gf. All right, so we need the expression data norm in both cases. And what do we need more? We need the p values, which is p values ht for hypothalamus and p values gf for gonadal fat. And then the next one is the samples. So this should match, of course. So these are the ht samples. And these are the gf samples. All right, so now we do the exact same thing. We do it for both. But in the end, we used to have 86 lines of code or 84. But now we have 88 lines of code. So I added four lines of code, and it now does both tissues. So making functions really saves a lot of code duplication and is really, really useful to do these kinds of computations which you want to do repeatedly. All right, so let's rerun question number two in R. Let's go to R. Let's rerun it in R. So of course, this now takes a long time, right? Because the linmodel function goes through all of the rows of the array. Really warm. All right, so question 3a, right? Because we can just look at the next question. What is the major pitfall when looking at genes differentially expressed between the two species, right? Think of direction of effect in the different tissues. OK, so the idea actually was to combine the two tissues together. That is actually not what, that's not good statistical modeling. Combining two completely different tissues where you know that there's a massive difference in mean is going to be different. But the major pitfall when you look at genes differentially expressed between the two species, of course, is going to be, if you are comparing gonadal fat and hypothalamus and you're looking at the genes, so you're looking at a gene, you might have a gene which goes from being lowly expressed in BFMI to being highly expressed in B6N in the gonadal fat. But in the other tissue, the effect is the complete opposite. And because that is the case, these two things cancel out. So let me make a small drawing for that while we wait for the whole linear model stuff to finish. Let me get my drawing board. I need to get my drawing window as well. So that is in streaming, in documents, PowerPoint. There is it. Let's make a new drawing window. Let's add this to OBS so I can start drawing. Drawing, yeah, there it is. So you can imagine that if we have two tissues, right? And we have, for example, if we look at the drawing pen and axis, so if we have BFMI, in the other case, we have B6, right? So imagine that I have my measurements here in BFMI, in tissue number one being low, and I have my B6 measurements in tissue number one being high, right? So now the linear model, when it looks at the first tissue, it will find a beta coefficient which is positive, right? So it's plus. But in the other tissue, right? In the gonadal fat, it could be that the BFMI, this probe is highly expressed, right? And in the B6, the same probe is lowly expressed, right? So it draws this and then the beta is going to be negative, right? So now if I would combine these two tissues together, what will happen? Well, it will see the following because it will first remove the effect of the difference of the tissue. So it will just calculate the mean of the BFMI in both tissues. So it will end up with a value which is around here, right? Let me use a slightly different color. Let's make this green, right? So the values, if I would do a linear model where I would say something like this, where I would say my expression is dependent on the tissue that I'm looking at plus the strain plus a certain error, right? If this would be my model, then what would happen if in the one tissue there's a negative effect and in the other tissue there's a positive effect, then of course first it would compensate for the tissue. So the average of this tissue is going to be compensated to here, right? Because it will look at the BFMI samples and it will say, well, the average of the BFMI samples in both tissues is around here. So it's around where the green thing is. So the sample measurements that I have will be adjusted. Then the same thing will happen in the B6. And now when I ask the question, well, after correcting for tissue, is there still an effect, then it will say no, right? So the fact that the beta is positive in the one tissue and the beta is negative in the other tissue, if I put up this complex linear model where I say, well, correct for the tissue effect and then look at what is left in the strain effect, which in our case is like the green line, then it would say that no, there is no difference in strain. And this is just because the two effects are opposite to each other and they will cancel each other out. So that is generally why I prefer not to merge different tissues together. And statistically speaking, it's also something that you don't really want to do, right? Because tissues tend to be very different from each other. So you're not going to want to know which genes are different between brain and heart, right? You might want to answer a question like that, but you will end up with literally thousands of genes being different because you're just looking at different tissues. All right, so here we had a small error. All right, so there was a record closing. Yes, because there's actually, if I look at the code, so let's go to Notepad++. Then for the postdoc test, something went wrong, right? So for the postdoc test, something went wrong and that is because the list, I say return a list and the list is closed but the return is not. So I have to close the return like this. All right, let's try this again. All right, now the function works. Argument five is, oh yeah, argument five is empty because I have a comma here and then there's nothing behind it. You guys can see that. All right, so here I just call the function, right? So the function is fine at this point. It just complains or it gives me an error on the return, right? Because it says you're making a list and then you're promising me that there will be another element by saying comma but then nothing is after this comma. So we can go to Notepad++. We can just remove the last comma here and then it should be fine. All right, so let's do our postdoc test then. I think this should be the last error that we have. All right, now, so for HT, we know the answers, right? So for hypothalamus, we see that there are 96 probes which are different from BFMI relative to B6. The BFMI relative to the F1, there's 40 probes which are different and there's one probe which is different from B6 to the F1s. And now also we have here the numbers of the probe, right? So we can look at probe 155 in the hypothalamus and then it should show us that there is a big difference between B6 and BFMI. And the same thing holds for these ones. But now we also have the answer for gonadofet, more or less relatively easy. So there's 128 which in gonadofet are different between BFMI and B6. 31 which are different between BFMI and the F1. And one which is different between the B6 and the F1s. That is actually a little bit strange that it's only one, five, three, seven, 30. And it's not the same one. So it might mean something biologically or might not mean. But we now can see that these are the probes that we should focus on if we are interested in the difference between the Berlin Fet mouse and the standard B6 mouse. All right, so with the little drawing right that I just showed you guys, I try to explain to you that it's not a good idea to merge the issues together. Because if you merge the issues together, you might kind of throw away the effect that you wanna see. Right, because in one tissue you have a positive effect and in the other tissue you have the opposite effect. And then when you do a big linear model where you just throw in all of the things that you're trying to model, it might be that by doing this kind of linear model it will more or less throw out your differences. Because then the bad guys will cancel each other out. So don't do that. So that was the answer to question three A. Is that if you throw everything into a big linear model then you might end up with having missing some genes which are upregulated in the one tissue and downregulated in the other one. The next question is more difficult. Is it valid to use linear modeling on this data set? And of course this really depends on how strict you are going to be. One of the things we didn't do and which I told you guys that you should do is make sure that the samples that we are looking at are a random subset of the population, right? But of course from the data that you have, you guys can't see if these three or four or five mice that we took actually represents the whole B6N or the whole BFMI population. So we can't really be sure that that's the case but imagine that you think that we did a good job random sampling then still there are other major assumptions underneath linear models because one of the major assumptions under linear modeling is that for example that the residual is a normal distribution and we did not look at the residual for any of the models, right? We did 33 or 35,000 models but for none of these 35,000 models we actually looked at the residuals seeing if it is a normal distribution. So is it valid to use linear modeling on this data set? The question is valid but the answer is just we don't know, right? We should write code which for every linear model that we does remembers the histogram of the residuals or does a proper test on the residuals to see if they are normally distributed. So that is a question. And then the question four is an additional question saying that perform the same analysis as before now use the unnormalized data. So the question is there to force you guys to write a function, right? So to take the code that you have and put it into a function like we did before. So like we did here where I say, okay so my analysis consists of more or less three different steps. So there are two different steps. The first step is to do the linear model and then the second step is to do the post hoc test. So after I know which genes are different or should be different according to the linear model I'm now going to do a post hoc test and this post hoc test is just a function, right? So in the end, if I would take the functions all the way to the top, right? Because in theory, I could put the functions in a separate file. Then now I could move the real statements all the way back and then these things could go into a separate file, right? So then we end up and I would save this file as additional or I would save this probably as functions a8.r, right? So if I would do it like this then I have to go to kxrcore, yeah, so that's fine. So now I would say here on the top source and load before you start using this file source the functions of assignment 8.r and then this would be the whole code for doing the analysis, right? Because that's generally how you keep your code clean and keep your scripts very manageable is to put things in functions and then you put the functions into a separate file which you source at the beginning. All right, so those were the assignments. Of course we could have gone much nutter, right? You could use like the thing I suggested quadratic linear modeling or whatever. But the big issue here is that the data set is relatively small. There's not a lot of samples but I think it's a first nice expression or introduction into kind of linear models because we had the lecture about linear models and we already looked at the data set from assignment six. So you had already loaded that in. We already did the normalization and now you can see that depending on which strategy you use if you just use basic t-test from the beginning with no pre-test, right? Because now we have our original linear model test after which we do a post hoc test using t-test we find a much more manageable number of genes in the first analysis that we did we identified literally like I think almost 13,000 genes that were differentially expressed by just using t-test and now we have very small lists of genes where we can say these are manageable. We can go to the lab or we can do a literature study, right? For when there's 90 genes different between BFMI and B6 these 90 genes is something that we can investigate but if we just do basic t-test and we end up with a list of 1,000 genes being different 1,000 genes is not something that you can really easily test. So, but a hundred is doable. Would be good that there would be slightly less but then of course we could just adjust our p-value saying that well let's be a little more strict and say instead of having a 5% false discovery rate we want to have a 1% false discovery rate. Good, so with that those were the assignments and then we will go back to the lecture after the break. So let me take my lecture. So we did the assignments. Oh, just as a reminder the exam is coming up. So please sign up in Agnes for the exam if you have a HAU account. If you do not have a HAU account or if you're an external student then do get into contact with our proofing bureau and have them put you on the list. It is very important that you are registered for exam. So the first exam will be on the 28th of July. The re-exam will be on the 23rd of September, right? Ninth month September, yeah, 23rd of September and the best grade will count. So if you do the first exam and you get a three then you can re-do the exam to get a better grade. If on the second exam you don't feel well and you get a worse grade, that doesn't matter. The best grade of the two will count. So sign up via Agnes, very important if you wanna join the exam. The exam will probably be via Zoom and Moodle and you guys have to take a photo of your answers and I'm going to supervise you by webcam or something like that. Unless there's only like a few people registering for the exam. So it says here via Zoom Moodle but it might be an in-person exam if there's only like five or 10 people that sign up because then it's easier and you can just distribute to guys around the building and you can just do it in person. Good, so linear mixed models. We'll have a short break of around 10 minutes and then we will be back on special request and remember you can always request animated GIFs for the breaks because if you say well this is my favorite animal then I can just find animated GIFs for this animal and for this week we actually had a request so the first break is going to be Tasmanian devils. So with that I'm going to start the music and we will go to the break. So we will be back in around 10 minutes so enjoy yourself and get a coffee or something fresh and I will see you in five to 10 minutes. Back, welcome back. So linear mixed effect models or called linear mixed models. So linear mixed models also called elements. So I showed you the slide before so last week we did general linear models and then you see here when there's correlation or random effects then you can use repeated measurement models or mixed models. So we will be focusing on this part of the models and then next week we will do generalized linear models for when there is no normal distribution. So for today we will be doing this lecture and this lecture is an adaptation of one of the introductory tutorials from Bodo Winter. He has a very, very good tutorial. So his tutorial is just a PDF. You can download the tutorial, the PDF from the link. The link is also in the description I think of this video and he also has a very good tutorial on standard linear models, so not linear mixed models. So what we did last week and definitely check that one out. And after we do his introduction, I will also show you guys an example from my current research. And I think we will be finished at around 420, 425. So that should be okay. So we're not going to use the whole three hours this week because it's just too warm to sit at home and burn. So it is going to be a short lecture. So the lecture is short, there's 22 pages of PDF and I compressed those 22 pages of his PDF into 29 slides. And then after that we of course have the example. But still I think we will be done more or less and half hour earlier than normal. Read the PDF because I'm going to ask questions about the PDF. So it is part of the lecture material. So it is going to be part of the exam. And of course ask any questions in the chat that you might have and if you don't understand things. So linear mixed effect models are more or less one of the more complex things in statistics. And I think it is really good to ask questions. But if you guys understand everything then that's fine as well. Then we will do the lecture without any questions, which is also perfectly fine. So for today we will be talking about linear mixed effect analysis, linear mixed effect models. So we are going to talk about random effects, random intercept, random slopes. I will show you how to do it in R, how to get the significance of the fixed effects for a linear mixed effect model. And I'm going to hopefully explain to you guys and learn you guys what the difference is between a random intercept model and a random slope model. And afterwards we will have a linear mixed model example on the Berlin Fetmauge of one of the studies that we did. This is not published yet. I send it to genetics like end of last year. We got it back with a bunch of questions from reviewers because it is a very difficult topic. And I'm still rewriting the paper. So I just wanted to show you guys one of the presentations that I did within our group just so that you have on the one side more or less an introduction. And on the other side you have an example on how this linear mixed effect models are used in research on a day-to-day basis. So the first question becomes why? Why do we need linear mixed effect models? Why are linear models not good enough? Well, the main assumption of linear models is that the measurements are independent. So what do you do when measurements are not independent? And what does it mean when measurements are not independent? Well, when measurements are not independent it means that you measure the same individual multiple times. For example, over time, right? You did a time series analysis. For example, you looked at body weight and you measured body weight of a single individual at 10 days, 20 days, 30 days, 40 days, right? And at that point you have a whole big data set but a lot of these measurements are on the same individual. So they are not independent of each other. So they don't provide, well, they provide additional statistical power but they are not providing independent measurements, right? And you see the same thing when we are talking about related individuals, right? If we are looking at a population and some of these individuals have the same father then of course these individuals are not complete independent samples and they share more or less half of their genome while two unrelated individuals to independent measurements actually would not share half of their genome, right? So when this is the case, when you are dealing with a more complex data structure so where there is more or less correlation between the measurements, right? Either correlations or measurements are done on the same individuals or measurements are done on individuals that come from the same family and you have, for example, multiple families participating then you have to account for this and we need to take this into account that the number of measurements is not equal to n, our sample size, right? Because n, like last time when we looked at linear models we said, well, we have a number of measurements and we have a number of parameters so the number of measurements is called n and the number of parameters is called k. But in this case and the power of linear modeling comes from the fact that we have more measurements than that we have parameters to estimate, right? That's kind of the degree of freedom game that you play. But in this case, of course, if I measure the same individual over time ten times I do not have ten times the statistical power. I only have a little bit extra statistical power because it's the same individual that we measured over and over again. So if we don't compensate for this fact that the number of measurements is not equal to big n then we overestimate the statistical power and this is really bad because this can mean that something which is borderline significant actually shows up as having a significance of one times ten to the power of fifty, right? So it looks very, very, very significant but actually it's not that significant at all. Even worse is when we get significant results which are not true at all. And this happens often due to relatedness. So if we have big families participating in a study for example we have a thousand people that we enroll in a study and these people are these thousand people are only part of like 50 different families well that would be a lot of that would be 20 people per family that would be big families but you can imagine that if you have groups of individuals which are related to each other these can look like really significant statistical effects while they are actually not significant effects at all. And this is called spurious relationships. So a relationship which is found to be significant so we say ah there's this gene on this chromosome and when you have a certain version of this gene it creates a disease but it actually doesn't it just looks like that because this disease is in certain families and of course when you are part of a family the chances of having the same genes go up tremendously, right? Because you share 50% of your genome when you come from the same father and it's even worse in human populations where there's all kinds of weird relationships so spurious relationships can cause false positives and they can influence your false positive rate tremendously making all of your results really really unreliable. So to avoid these spurious relationships and to not overestimate the statistical powers we need to use linear mixed models compared to standard linear models to account for the fact that measurements are not independent that they are related to each other or that they are even done on the same individual. So when we talk about linear models we always try to model a relationship, right? On the one side we have the response the thing that we are trying to predict and on the other side we have the predictors the things that we use to make our prediction. So in the tutorial of Bodo Winter he looks at pitch. So pitch is the highness of your voice, right? So women in general have a much higher voice than males so that's a different in pitch. So we can have low pitch meaning that the average voice is very low and we can have high pitch when your voice is really high, right? So that's the difference. So when you want to get the data then you can go to this address in the data set that he is using for his whole analysis and for the whole tutorial. So when you look at the data set that he provides then there is subject which is a person there is gender which is the sex of the person there is a scenario which in this case is a certain question and then he has attitude which is polite versus informal called poll versus inf and then we have frequency and that is the thing that he has measured. So they want to know if there is a relationship between the frequency of your voice and if you are in a formal versus an informal setting. So for example if you're talking to your boss do you speak differently than when talking to your friends? Because there might be a difference in that. So that is what they are trying to figure out and for that they have more or less different questions that people are supposed to ask and these people are put into different situations where in one case they have to be polite and in the other case it is a very informal setting. So the most elementary linear model that they could come up with is that this is the original hypothesis of Winter and Gravunder which was published in 2012 and that means that the frequency of your voice is dependent on the attitude or the attitude that you are in. So the setting that you are in. So it's a two level categorical variable and it can be formal or it can be informal. Of course we need to include the sex of each of the participants because we know without a doubt that if we have males and females in our study that females on average have a much, much higher voice than males. So this is just a biological phenomena and we need to account for that. But of course this linear model is already, it has two predictors and you have the response variable frequency but now things get a little bit more complicated because by the design that they have they have multiple measurements per subject. So that means that if you would just plug it into your linear model and you would just run this model then the model would not know that you have measured the same subject ten times. So it would think oh you have a thousand measurements so that you have a lot of statistical power while they actually only interviewed somewhere around 15 to 20 different people. So they don't have that much statistical power. So by having the same questions being asked to the same person over and over again or having the same person ask the same question over and over again they complicate their data set a lot. And this is where these random effects come in. So every subject, every person in the world has a slightly different voice pitch and this is going to be a factor that affects all the responses from the same subjects thus rendering the different responses interdependent rather than independent. So if we just think about it then we might say that well if we have person number one he might have like a 233 hertz utterance while subject number two is a male and has like a 20 hertz lower voice on average. And of course we need to account for this in our analysis. So when we look at the data so I just took the data and I just make a simple box plot so here we see the different subjects. F1 is female one, M1 is male one. Then we can clearly see that there is a massive difference between males and females. Males on average have a voice pitch which ranges from like 100 hertz to like 150 hertz. If we look at the females then the females on average or they range from like 200 hertz all the way up to 270 hertz. So there's a big difference between males and females but we can also see that if we look at females then there is already a big difference in the pitch of their voice. Some females have a much lower voice and some females have a much higher voice. So this means that we definitely have to correct for this effect because if we just say that okay so we have question number one and we are going to model it and we say well we assign some of the variants to sex we assign some of the variants to the attitude you will find very very significant results but these results won't be caused due to the fact that every person is slightly different from each other. So we need to extend our model. So the most basic way to do this is to model individual differences and allowing each individual to have its own mean. So when we have a linear model we normally have an intercept and the intercept is the mean of the whole population but now we're going to model it slightly different and say well no when we model the frequency of someone's voice then this is dependent on their gender it is dependent on the attitude in which they are so in which setting they are a formal setting or an informal setting but I want to allow for each individual called subject so for each subject I am going to allow each subject to have its own intercept which is the one here so this means every intercept that I'm going to calculate is going to be dependent on subject right so instead of having a global mean and having all of the effects that we look at being relative to the global mean we are going to say no question one for this individual is 250 Hertz which is relative to their personal mean so we just allow every individual to have their own individual intercept and this is called a random intercept model so this is also the way that you write it down in R without the plus E because that we never write down the error term in R but we just say frequency aquilada attitude plus gender plus and then between brackets one dependent on subject so one horizontal bar subject so not only in their experimental setup do they have different individuals that get asked over and over again they also have different questions because they have different questions that each individual needs to ask in different settings but similar to the by subject variation by individual variation we also expect by item variation we expect every question to have a different mean because every question is slightly different because there are different vowels in there there are different letters in there and every letter also has its own pitch so saying AAW is different from saying an A so there might be something special about excuse me for coming late leading to an overall higher pitch compared to asking for a favor so I'm going to ask for a favor and this is irregardless of the influence of the politeness setting and of course if we look at that so if we look at all of the questions across all of the individuals we do see that we do see that question number one has a pitch of around 220 Hz on average but we see that for example question number seven only has a pitch of 70 Hz not only is there per individual variation so every individual should be allowed to have its own mean but also every question that they ask should be allowed to have their own mean so we need to extend this model again so what do we do so we just say well we have the frequency of your voice which is dependent on the attitude that you are currently in so the setting that you are currently in plus the gender that you have then an individual is allowed to have its own intercept but every item is also allowed to have its own intercept so we take both the variants that exist between individuals into account but we also want to take into account the variants which exist between different questions because not two questions are the same so in R this is relatively easy to model but the problem is in R that there is no support by default for linear mixed model they are provided using the LME4 package there's also the LMER package and there's other packages as well because linear mixed models are an active field of research there are multiple packages in R which provide multiple different implementation of linear mixed models so during this lecture we will be talking about the LME4 package so the LME4 package provides a function called LMER for linear mixed effect regression and this is very similar to the LM function for linear models but this provides linear mixed models so how do we do this? so in R we just load the LME4 library of course we need to install it first by doing install packages LME4 here we have the URL where you can find the data and then we just use a read CSV on the URL directly and then we store this in politeness so if we make a boxplot where we plot the frequency relative to the attitude and the gender we just make a boxplot where we can see okay so we can see the different settings and different individuals and we can compare those against each other so now if we want to run this as a linear mixed effect model when we try this we get an error right if I say LMER the frequency of the voice is dependent on the setting that you're in and the gender that you have using the data from the politeness data set it will just throw an error it will say that there are no random term specified in the formula because the LMER function expects a linear mixed effect model you need to include at least one linear or random effect so either being a random intercept or a random slope so let's do that so let's say that we have our politeness model which is an LMER so we're saying more or less that the frequency of the voice is dependent on the attitude, the gender and then every subject is allowed to have its own intercept and every scenario so every question is also allowed to have its own intercept because of the fact that we just saw that every question or every scenario has a different mean and the same thing holds for every individual so every individual has its own mean as well and we already saw that there's a big effect of gender because that's just biological right so in the end we can pose this model in R and then we just get the summary of the model just to see what's going on with our model and how the output of a linear mixed effect model looks like so here we see the summary of the politeness model that we did so it tells us the frequency or it tells us the formula that we used it tells us the restricted maximum likelihood criteria on in all of these things but what we actually are interested in is the fixed effect because what we were interested in is if the attitude scenario where you were in had a difference on your voice if you would talk differently to your boss than to your friends right so we can see this here under the fixed effect category so under the fixed effect category we see that we have attitude pull so the polite attitude the estimate is that if you are in a polite setting relative to an informal setting your voice goes down by around 19 hertz what we also need to take good care of and if you build your own linear mixed models is this line here right because here it says that the number of observations that we have is 83 so we have 83 rows in our data set but there are two groups in our data so there are seven different scenarios and there are six different subjects and this need to match with the number of independent observations right so this is very important to check this that this lines up with your own data so that just make sure that you check this and that there is really a these numbers match the numbers that you expect them to be because this is where the p-value calculation is based on and this is like here it says six subjects seven scenarios and that means that in total we have an x amount of degrees of freedom and not 83 independent measurements right so if we want to get the significance of a model we have we can't just do a ANOVA we have to compare models to each other because we can only by doing a differential model test by comparing one model to the other model can we know what the significance is of a certain of a certain effect right so here if we want to get the model significance or if you come to get the p-value for the attitude variable right then what we do is we take a politeness null model so we make we make a linear mixed effect model saying that well the frequency of your voice is dependent on the gender plus a random intercept for subject and a random intercept for scenario and then we have to compare our model which includes attitude towards the model which does not include attitude right so it's always if you want to get a probability value you need with linear mixed effect models you need to compare one model to another model because you can't just throw it into an ANOVA function like you do with standard linear models and get the p-value here you need to compare models together and to compare them and calculate a p-value you actually have to also add the parameter remmel is false so you have to make sure that you say that no I do not want to use restricted maximum likelihood I want to use standard maximum likelihood estimates for my model so remember that when you compare two models two linear mixed effect models you have to make sure that both of these models are fitted on the exact same data set and that both models are fitted using maximum likelihood and not using restricted maximum likelihood good so when we do that we post these two models in R we can then do an ANOVA test saying compare the model where we include attitude to the model which does not include attitude right so what we see here is we see the two models that we have and then we see here the degrees of freedom which are taken by the models we see the AIC so again a drop of 10 points is generally what we want in this case we see that there is almost a drop of 10 points so we are going to prefer the model which includes the attitude right and it tells us now that there is a probability that this is true right that there is a true effect of attitude on our politeness or on our frequency and this is 0.0065 good so that's everything that's everything that you need to know about linear mixed effect models right make sure that you check that the groups are assigned correctly make sure to fit the models using maximum likelihood and not restricted maximum likelihood and make sure that when you want to calculate a pval you just don't throw the model into the ANOVA function but that you compare the model with the factor of interest versus a model which does not contain your factor of interest and then here we can take the AIC and say well the AIC improved a lot and the model also takes some more degrees of freedom so it's a more complex model but in the end there is the likelihood or the likelihood that the probability that this model or that attitude influences your frequency of your voice is actually very significant so there is a 0.006 so very significant so how do you now write this down in publication right so politeness effects pitch you write down the chi-square value which you can find here the chi-square has one degree of freedom so you write down chi-square one degree of freedom you write down the critical value you write down the p-value associated with this chi-square value and then you say lowering it by about 19.7 hertz plus minus 5.6 standard errors right and we get these from the model that we had before right so we say attitude polite and we see the standard error so we get those from here alright so we now looked at random intercepts so we now looked and we concluded that yes when we look at our data we see that every individual has a slightly unique voice they have a slightly they have a personal mean so you see that each scenario and each subject when you look at random effects you can see here that you have the intercept so you have question number 127 you have the intercept which is the mean of this question then you see the effect of the attitude polite which is 19.7 for all of them because of course that's what we ask we want to know how is the politeness affected across all individuals and across all scenarios that we have and here we see also the effect which is the gender male right so males have a much lower voice than females around 108 hertz different right and this is exactly what we would expect because we've told the model to take into account that individuals are different and that questions are also different right so that for every individual we get an estimate of their own mean voice and for every scenario we also get an estimate of their mean well the mean frequency that they are spoken at and that of course is dependent on the letters that are in there because a p is a different letter than an o or a w of course at these fixed effect attitude and gender like we saw are the same for all subjects and all items this is because our model is a random intercept model we are only allowing the intercept so the average value will be different from individual to individual from question to question right so in this model we account for the baseline differences in pitch we assume that whatever the effect is of politeness it's going to be the same for every person and every item but this is of course a very strong assumption it might not be that every individual reacts the same to politeness situation or situations which are informal right it might be that not every individual has the same pattern in the questions when we look at the box plot for the questions we see that question one is kind of average question seven is very low in the frequency and question two is very high but if we would plot the measurements of a single individual then it doesn't have to exactly follow this pattern it might be that some individuals actually for question number two actually have a relatively low mean pitch while other individuals have a relatively high mean pitch right so we assume that the things that we do are going to be the same for all items and subject right but this might not be true right we might expect that some people are more polite than other ones right that some people they are in a formal setting that they lower their voice by 30 hertz while other people don't really care about the setting and they just lower their voice by like 5 hertz right so if that is the case we need not only a random intercept for every individual but we also need to allow for the effect that we observe to be individual based as well right so if this is the case then we need a random slope model where subjects and items are not only allowed to differ intercept wise so differ mean wise but that they are also allowed to have different slope for the effect of politeness right so we might have some individuals might lower their voice a lot other individuals might lower their voice less it might even be based on gender it might be that females lower their voice much less than males could be the opposite way right so there's a whole bunch of assumptions that go into linear mixed effect model and you have to be very aware that what you are modeling and how you are modeling it is kind of having a massive effect on your results so if we want to build a random slope model where we are going to say well we have the frequency of your voice is dependent on the attitude as a fixed effect the gender as a fixed effect but now we are also going to allow the attitude to influence the slope so how do we write that down in R we say 1 plus attitude dependent on subject or 1 plus attitude dependent on scenario right so in this case we fit a random intercept for subject we fit a random intercept for scenario and a random slope for the attitude and a random slope for the scenario that means that every individual when we look in the end at this output table every scenario will get their own estimate for attitude politeness instead of the standard 19.7 estimated across all individuals and across all scenarios and it will also every subject will also get their own attitude politeness estimate right of course we can basically write it down the way that we did before but we can fit this so we can build a model where we can say that that that per individual there is a different mean voice and per individual there is a different effect right so some individuals might drop their voice a lot some individuals might drop their voice only minor and the same thing across questions right some questions might make you drop your voice much more seem much less dominant so then other questions right so the notation 1 plus attitude dependent on subject tells you that the model to expect is expecting different baseline levels of frequency the intercept which is represented by the one as well as a different response to the main factor in question attitude good and I want to leave it there right so I told you about mixed models I told you about random effects so what is a random intercept model well a random intercept model means that you allow every individual to have an own intercept to have an own average value for something a random slope model means that you not only allow that you not only allow the individuals to differ in their mean value but you also allow individuals to differ in their response right and then in the end of course you get an average effect an average like slope right between all of the different slopes that you find for all of the different individuals and then you write that down good so for the assignment for today I want you guys to read the tutorial because the tutorial is 29 pages and the tutorial is much more extensive and much more comprehensive than me trying to explain it to you guys and then of course there's the practice practice practice exercises that you can always continue with online so I wanted to give you an example of my own research which I think is for me better to explain probably for you it's much harder because it's relative to bioinformatics and it's about QTL mapping so I think for you guys it will be a little bit difficult to follow and that's why I didn't use my own example to explain you guys what a random intercept and what a random slope is so here I want to give you an example about linear mixed model multiple QTL time series mapping right and I will go kind of through it step by step but of course first we're going to do a very short break because it's the second topic for today so the second break is going to have honey badgers honey badgers so I will start the music sorry that's really loud in my ears so I'm going to make it a little bit slower so that I don't blow up my own ears so yeah honey badgers so I'm going to switch you guys to the break and then we will have music the music is going to be a little bit shorter going to be like six minutes so we'll continue at 350 ish and then we will just run through an example of how you can use linear mixed model to improve the finding of genes in the genome which are influencing fat mass and by doing a time series analysis instead of just doing single body weight measurements good so with that let's start the honey badgers and start the music and then I will be right back that is very very good alright so remember this is difficult this is state of the art statistics this is something that has not been published like no one has done this ever before and also linear mixed effect models are difficult and are used in research nowadays a lot a lot more than linear models but a lot of the papers that you will read online even from like a couple of years ago they will just use standard linear models and only when they when you are forced to because of the fact that your experimental design is very difficult will people start reaching to linear mixed models so I'm just going to give you guys the overview of what we are going to talk about in the second hour or third hour second part so I wanted to give you a short introduction into linear mixed models I wanted to give you a short introduction into multiple QTL mapping and I will talk a little bit about the Berlin fat mouse advanced inbred line so that you know all of the different types that are using topics that we have and then I will talk about model selection mostly so how do you find the best fitting model using the AIC so using the information criterion and besides that I will be talking about litter size, litter number and litter types and I will be talking about growth curves and how you can model growth curves and then I will just show you very quickly the results that we found and why we think that this is really interesting and why we think about the methodology that we came up with is a much better methodology than how people do it nowadays so of course we just had a whole hour where we talked about linear mixed models but I'm just going to give you a very short introduction I think I stole the introduction originally from Wikipedia and then multiple QTL mapping if you want to learn more about QTL mapping in the bioinformatics lecture series there is a QTL mapping lecture that goes into great detail in how you do QTL mapping multiple QTL mapping is slightly different and slightly more complex so I hope that I can explain to you guys how that works so like we said linear mixed models is an extension to linear models it allows for combining of fixed and random effects so fixed effects are model parameters that are fixed that do not vary across your subjects that do not vary across your questions or across the things that you have so the nice advantage of a fixed effect is that it is a non-biased estimate for a parameter because it does not depend on anything it does not depend on the individual that you measured it does not depend on who his father is or who his mother is but a random effect is nice because it's a model parameter that is considered it's a model parameter model parameters are considered as random variables so they are things like random intercepts or random slopes and this allows us to have a hierarchy between variables so we can say that now we want to have every individual be allowed to have its own individual mean and the advantage of this is that it is efficient and the effect models are good at dealing with things like repeated measurements so if you have measured the same individual over and over and over again across a time period then you need to use random effects to capture the fact that these individuals are the same individual so an example I just pulled this of Wikipedia originally from when I was explaining it in our group and for example we have m large elementary schools from a single country we have n pupils at each school that are randomly chosen at this school and then we have yij which is the score of the j pupil at the y school so we have a hierarchy we have individuals or pupils that go to schools and these schools are located in a country and these individuals they have scores they for example did their final exam right so if we look at Wikipedia Wikipedia explains random effects like this so there is an average test score for the entire population which is called mu but of course each school is kind of like a group because a school also has its own average test score just like in the Bodo winter example any individual could have their own individual voice but here every school of course has its own more or less education it has its own teachers so school in this case is a grouping factor individuals collectively go to a single school but all these individuals going to this school make them more similar than two pupils which go to two different schools because being at school one or being at school two is that you get a different type of education there's different teachers there's different equipment and these kinds of things and then here they also allow for an individual specific random effect because every student is different so every student will have a score but the score is based on all of the scores that this individual has you can be a poor student at a very good school and the worst student of a very good school can be better than the best student at a very poor school and the other way around there's a lot of hierarchy in just this very basic setup where you have different schools where you take a couple of individuals from this school and then you try to estimate how good a school is relative to another school fixed effects can capture differences in score among different groups across different schools and for example we can look to see if on the test that we did that for example males and females can have different scores based on if for example a subject is more male related or more female related the race of an individual can also have a big influence on how they perform on tests like if you are a white individual or if you're a Chinese individual or if you're an African-American or if you're Hispanic all of these things have an influence on how you perform on the test right and then for example you could also think like well perhaps the education level of the parent for a certain child has an influence on how they perform on this test right so in the end we have our score for each individual right so we have a big vector of all of the individuals that we measured they all had a score we look if it's a male or a female yes they are we see how much their parent education was there but then we also have these random effects so we have the random effect for you so you was the school specific random effect and then we still have the individual specific random effect because every individual is just different so that's how this works good so this is how you would build up a model like this right so you would say we have three fixed effects which we are interested in then we have the school as a grouping factor and because individuals going to the same school are all allowed to have a school dependent average and then furthermore every individual is different so every individual is allowed to have its own score on the test good so normally when we do QTL mapping so definitely if you want to understand this part much better look at the QTL mapping lecture that I did in the mathematics course because if you don't understand what QTL mapping is you are not going to understand what multiple QTL mapping is but QTL mapping is basically if we have a phenotype right so phenotype is a vector of values and we have measured a marker in the genome which is just three possible states right you are either you are either fully A if you look at a SNP right or you can be homozygous alternative but in multiple QTL mapping we extend this in multiple QTL mapping we say that no we should not look at a single marker in the genome and see if it has an effect on the phenotype that we are interested in we should look at multiple markers at the same time right we should account for the fact that for example there is a marker that we already know from previous research that has a big effect so what it does it accounts for known genetic effects right imagine things like obesity have been studied not just once right but we have 50 years of literature and science that has been looking for genetic determinants so pieces of the genome which are involved in obesity so when we do a scan across the genome we should not consider each marker independently no we should consider the marker that we are currently looking at but we should also consider the other markers that we already know have an influence and because we account for these known genetic effects multiple QTL mapping has more power to detect other effects relative to standard QTL mapping it also allows us to disentangle QTLs which are in close proximity to each other and it also allows us to deal with QTLs with opposite conditions of effect so multiple QTL mapping is mostly based on model selection and we first figure out what is the best model genetic model so which markers do we need to have or account for into the genome to predict our phenotype and then using this best model we now start adding in every marker to see if the marker still has an effect after accounting for all of the known genetic effects that we already know so how does this work well here this is the picture that my original professor from Groningen made and it means that we have different models so when we are mapping here we are taking so here on the bottom we see for example a chromosome and on this chromosome we have markers those are these little R's on the bottom so here we have for example a chromosome which is 155 centimorgans long which is a unit for genetic distance and there are two regions on this chromosome where we know that there is something there which is controlling our phenotype that we are currently looking at so now what happens is that when we are mapping a certain marker for example here at the beginning we are in something which is called region 1 because region 1 is not nearby any of the markers that we know have an effect so the model here will include not only the marker that we are looking at but also the two markers that we know have a genetic effect already right so when we are mapping here we are compensating first for the two known genetic effects and then we are looking to see if the marker here still has an effect on our phenotype the same thing holds for the region here and the region here because we are far away from the markers that we know have an effect however when we enter this region here where we know that this marker has an effect we have to drop this from the model so here in region 2 what we are doing is we are asking the question for example this marker here which is close to the marker which we have which we know has a genetic effect does it have a genetic effect when we compensate for the marker here so we are dropping the marker from the model when we get close to it and in region 3 we do the opposite so if we want to know what the association of this marker is to our phenotype we are dropping the marker from here so we are only including the marker here into our model and then we are asking after compensating for marker at location 2 does our marker here still have an association so this allows us to get a much more fine grained view of the genome so this is in more or less in essence what milk book et al mapping is good so the mouse model that we are going to investigate is the model organism for polygenic obesity so it is our berlin fat mouse which you see here and this is the more or less standard p6 laboratory mouse strain we already know a lot about these mice we know that for example a big part of the obesity is caused by the subfamily so which family you are born in we know that litter number and litter size have a big effect we know that the season in which they are born also has a minor effect on their obesity and we know that there is one marker in the genome the juvenile obesity 1 locus on chromosome 3 40% of the variation that we see in the obesity so it is not just genetics it is also more or less environment which mother did you have but our berlin fat mouse is a mouse which has been selected for high fat percentage and it has been selected for high fat for generations for almost 100 generations and not only that but it has several features of the some berlin fat mice they are type 2 diabetic other berlin fat mouse they have like high levels of triglycerides and these kinds of things so they are a very good model organism for obesity that occurs in people who are relatively fat so for example they eat the wrong diet, they become fat and then they look more or less like a berlin fat mouse in a way so we always say that the berlin fat mouse is a good model organism for obesity and then obesity which is or not just obesity but it is a model for obesity and it is not just a model for obesity but also for type 2 diabetes which occurs in tandem with the obesity of the mouse so what did we do so we have 344 of these mice so these mice are AIL mice which means that they are descendant from a berlin fat mouse being crossed with AB6 so we take one BFMI we take one B6 we put them in a cage, they get children what do we do we take these children and we mate them brother to sister and then we get the second generation and in the second generation we again start randomly mating brothers with sisters from the second generation 28 generations so in generation 28 we had 344 individuals and these individuals have part of their genome being from the original BFMI part of their genome from the original B6 so what we need to do is we need to measure every section of the genome for each mouse so what we did for each of these 344 mice we measured markers and then for each of these measurements we know this individual has two chromosomes at this position which come from the berlin fat mouse or it has two chromosomes that come from the B6 mouse or it has one chromosome from the fat mouse and one chromosome from the B6 mouse so at each marker in the genome there are three possible states we measured in total 17971 of these genetic markers after quality control what we measured is for each of these individuals so for each of these 344 344 individuals we have time series data on their body weight so we know how much each individual weighs at day 21 day 28, day 35 and so on so very large data set very very large because we have like 10 or 7 time points at which we measure those mice but not just that we also measured each of these individuals around 18,000 positions at their genome so what we are going to do is model selection so model selection is the task of selecting a statistical model from a set of candidate models and we can use for example the AIC criterion to estimate the relative quality of statistical models and by AIC I already told you guys when we talked about AIC is that the lower the AIC the better the model fits so what we are trying to do is comparing apples because we have two different statistical models and now we want to ask the question which one of these is better and in this case for the AIC the lower the better so as an example of model selection let's get into some biology so when we have a mouse a mouse has a litter size this is the number of individuals that are being born from this mouse but of course a mouse during its life doesn't just become pregnant once we also have a litter number this is the nth litter of the female because a female has its first litter with eight individuals then it has the second litter there are seven individuals then it has its third litter and there are five individuals for example so we can encode this into different ways because biologically speaking we know that when a mouse is pregnant for the first time the litter that comes out will be relatively light there is no real difference in the number of individuals being born between the first litter and the second litter but the first litter just has less mass because it's the first time that the mouse became pregnant the second and the third litter are not really different from each other but the first litter from a mouse is very different from other litter that it will give later on so we can say we have litter A, litter B, litter C so if we would do that in our data set then this would be five levels because we have mice being born from the first litter from the second litter, from the third litter and so on but we know biologically speaking that it's probably better to encode the litter number in two in a factor which is two levels we can encode it as F being the first litter or N being not the first litter so any other litter be it the second, the third, the fourth or the fifth and then what we can do is we can combine these two things together so we can create something called a litter type which is a combination of the litter size and the litter number right so we can for example say because we can encode it using five levels or two levels so we can say LT5 so that means that we have A8 so A means this was the first litter and eight mice were born B10 means this was the second litter and 10 individuals were born C12 means this was the third litter and 12 individuals were born but we can also encode it more compact we can use this first versus not the first LT2 so the second litter type that we have we can say F8 so A8 means the first litter eight individuals were born so that would translate to F8 then we have B10 second litter 10 individuals were born and this would be encoded as N10 right it's not the first litter C12 would now be coded as N12 not the first litter and it has 12 mice being born I don't know why this F10 is there that should be gone but if we do this then in the end we end up with LT5 having much more levels than LT2 because here we are dealing with five different levels of litter while here we are only dealing with two levels of litter so this is statistically more powerful compared to having five different levels of course the biology holds true and the first litter is different from every other litter after that so we did this we modeled this so we made our null model right so we have our phenotype is the body weight is the father to which you were born because your father has a big effect on your body weight and then every individual gets a random effect we measured every individual not once but we have measured every individual multiple times seven different time points right so then we are going to compare different models so here you see all of the different possible models that we have and here you see the number of degrees of freedom that is being taken up so father there were 30 fathers in our population so there is always going to be 30 degrees of freedom that we need to spend on the bad ass for the father effect right because there is a bad ass father so every father gets its own average and then we have the random effect for one by individual and then we can add the other effects that we have so we can say ln2, ln5, ln2 plus ls plus litter size, ln5 plus the litter size, litter type 2 or litter type 5 and you can see that every time that we model this we use a different number of degrees of freedom so this is of course because if we have ln2 right then we just say well it's the first litter or it's not the first litter here we say it's the first litter, second litter, third litter fourth litter, fifth litter so this is just the same covariate the same factor but this factor in this case is coded with two levels here it's coded with five levels here it's coded with two levels but we add the litter size as a numeric variable and here we say the same thing again as a numeric variable the litter size and then we have these two litter types that we defined and then when we compare all of these models to each other because you can only compare models to each other you can't compare models to a certain null model then in the end we find out that we get the best AIC improvement across all of the different models when we encode it as this litter type 2 right so this is the first step in our analysis is try to get a minimal model which captures this biological phenomenon where individuals born from the same father are more similar every individual of course has its own more or less birth weight its own random intercept and then of course there's an effect of how many brothers and sisters you had the litter size and then it also has an effect if your mother was pregnant for the first time or if your mother was pregnant for like the fourth or the fifth time so we then continue so we first figured out that litter type 2 is the minimum way or the best way that we can encode this biological phenomenon for the birth weight of a mouse but then of course we need to add other things to the model to see if they have an effect so what we did is we first said well it might be that season has an effect if you're born in the summer you might have a different body weight a different growth curve compared to individuals which are not born into the summer but then we saw that the AIC of this was not significant it didn't drop enough so we say that season should not be a fixed effect season has no effect on the body weight of our mice what we then did is we added time should be a fixed effect and then we looked to see when including time we should include time also as a random slope and then we figured out that yes time should also be a random slope effect so every individual has its own individual growth curve trajectory not only is time affecting there is not only a linear effect on time every individual has the same more or less time intercept but every individual also has their own individual more or less time trajectory and then we did time to the power of 2 time to the power of 3 we saw that the model continued to improve but then when we started adding time to the power of 4 we saw that our model did not improve anymore so time to the power of 4 should not be a fixed effect but after we had that we started doing multiple QTL mapping so we started inputting this known effect so this known effect is this JOB1 marker on chromosome 3 where we know that this has a massive effect on the body weight of our mice because this is kind of the thing that makes you a fat mouse or makes you a lean mouse and we also allow this to have an interaction with time and this again improved our model very significantly and that told us that JOB stop marker and the interaction of the JOB with time should be included into our model so how do you now visualize this right we had a bunch of models that we did and all of these models are difficult to understand when there is no when you can visualize it so what do we see we see here all of our data that we have so every dot so every cloud of dots here has 344 dots 344 measurements here we also see 344 measurements here as well here as well here as well and the thing that I did is that I shifted it because the first time that we weighed a mouse was at 21 days so I said that this is the number of days so the time since start time since the start of the experiment so when we first weighed a mouse then the clock started running so the first measurements are done at 0 which is actually at 21 days but we shifted to 0 because it's easier so the most basic model that we are up with is that we have a mean effect so we just say that no we don't care about time we don't care about growth curve the best model that we can find is just the average so we just calculate the average over all time points and that is our model so we just say that the only thing influencing your body weight is just the average body weight we saw during the experiment this is of course a horrible model it's just a single straight line we do not show any of the random effects here because of course we had a individual so every individual is allowed to have its own intercept so in theory this is the line which is the global mean but of course there are 344 additional lines which I'm not showing because then the figure would be ununderstandable but this is the most basic model that you can do so now let's just step to the modeling procedure that we did step by step and add in all of these factors and see how the fit of our model changes because you can see here that of course here the residuals of this model are going to be horrible like the individual here relative to the line adds like 25 units squared of misfit to the model because the model should follow the data as good as possible and this model does not do that but so the first thing that we did is include time right so when we include time then we see that the model starts fitting the data better it doesn't fit perfectly yet but it fits a lot better than the original model right this one has like a massive residuals well this model has less residuals so this fits time so the mean plus the time then we started to because we also included time to the power of 2 we now see that the line starts fitting the average of the distribution quite well right so this is already starting to be a relatively good model describing our growth curve but from our modeling approach we also learned that we should include time to the power of 3 so when we start including time to the power of 3 and the difference is not that big so I'm going to just wiggle to and from a couple of times right you see that indeed time to the power of 3 seems to it seems to improve you see that by just having time to the power of 2 versus time to the power of 3 it does seem that they are growing slightly quicker in the beginning than that they are growing at the end of the experiment and there seems to be like a more stationary phase here where they're not growing that fast compared to the so this model it does seem to fit slightly better of course we also added things like the father right so if you add the father then now all of a sudden you don't have one line right because we have 30 different fathers now we have 30 lines so if we include father as a fixed effect we start seeing that indeed it starts fitting the growth curve at least at the beginning very well right at the end the prediction is not as good as we expected to be but at the beginning when we have father time average weight time to the power of 2 time to the power of 3 then it starts fitting relatively well to the original data that we saw and we see that our model at the end is not as good but of course we did not include all of our factors yet so the next thing that we actually included of course is this litter type 2 so this thing that we found litter type we see now that the data or the model starts fitting relatively well right you see that here it's kind of wonky in the beginning right because it tries to fit the whole curve but of course the beginning is not as well defined because at 3 weeks of age the amount of individual or the amount of siblings that you had has of course a massive effect on your on your body weight and this effect on your body weight of weight becomes less and less so by including this litter type 2 we see that the line more or less widens up a little bit in the beginning and we see start seeing that the model starts fitting pretty well actually still of course there's a lot of individuals that are much bigger at the end of the experiment compared to what we expect but this is of course because we didn't include any genetic effects yet right so we are at this point we're only starting to kind of model the body weight by including the fixed effects like family litter type and then we have over time we just say that individuals grow over time so what happens now when we start splitting our model or when we start adding in our model this BFMI locus this locus that we found on chromosome 3 which has a massive effect on your body weight now we start seeing that all of these lines are having different colors right because you can either see at this locus having two chromosomes from the B6 from the lean mouse or you can have two chromosomes from the fat mouse or you can have a mix which we call heterozygous right so what we see now is that when we include these genetic effects we do see indeed that at the end of the model right we explain much more variance right our predictions are not exactly spot on but it's already been saying well it's like this right this fits much better we the residuals of the points towards the lines is much less than when we look at model right so here we include this massive effect and then we indeed see that individuals carrying the BFMI allele end up being much bigger on average than individuals which inherited a B6 and allele or heterozygous state at this locus so this is going to be our minimal model and this was the model that we used and now we start adding in other genetic markers so we now have this big model which explains the large part of the body weight but now we want to see are there other markers into the genome when we include them improve the fit of our model improve the prediction so to speak so what we did is we scanned the genome right so every dot here is a marker on chromosome 1 so here we just see chromosome 1 and what we see is we see the minus log 10 p value so we see the association so here a score of 2 means that there's a 1 times 10 to the minus 2 probability so 0.01 that this marker is involved and here you see the 5% or the 5% and the 10% line so at each part we add this marker X that we're looking at so here we're looking at this marker and here we're looking at this marker located more or less proximal on chromosome so this is distal and this is proximal so these are proximal chromosome 1 and distal chromosome so we scan each of these chromosomes and what we observe is that indeed yes we find new loci which we have never seen before so there seem to be additional loci besides the marker on chromosome 3 which do influence the body weight but they don't influence the body weight at individual time points significantly they influence the growth curve of the mouse across the whole time period that we look at and this is the power of these linear mixed effect models because we can now look at the whole growth curve we don't have to limit ourselves to looking at a single time point and seeing if the body weight at this time point is it controlled by a marker into the genome no we account for the massive marker the massive J-O-B's 1 marker and after correcting for that we look then to see can we improve the fit of our model towards the growth curves and indeed there are still markers into the genome which improve our growth curve fit to to our model right so here we do expect that somewhere here there's going to be a gene which is increasing or decreasing the body weight so we did this for all of the chromosome and in total we deducted 5 new TTL's so here we see where they are located so chromosome 1, 2 additional ones on chromosome 3, 1 on chromosome 9, 1 on 19 here you see also the effects relative to the B6N so relative to having 2 markers from the from the standard lean mouse and what we can see is here we see the BFMI effect and we see that the BFMI effect is much much bigger than all of these effects that we can detect at these other regions but still these other regions actually have like some of them have like half of the effect of the BFMI locus and we never detected these before because we never model the whole growth curve we always looked at individual time points and asked the question at this time point is there a marker in the genome which can explain the body weight but we never asked the question does this marker influence the shape of the growth curve for the individual mouse alright so on chromosome 3 we found 2 additional TTL's next to the massive JOB's 1 locus which were actually hidden by in the original mapping due to linkage and this is the advantage of multiple TTL mapping but that's not that important for this lecture. One of the interesting things that we found is we found a little drop here at something which we call novel region 3 we see that there's a massive dip in lot scores so when we zoom into that region then here you see the lot score curve that you saw before so the blue line here is the same as the blue line here and what we observe is that there is this dip so at this point there doesn't seem to be any association but this is because one of the groups is completely missing so here we see that we only find individuals at this marker which are either fully which are either heterozygous or which are either BFMI we don't find any individuals in our population which have homozygous B6N and there's only one gene in this region when we start looking at this gene it is actually one of the main regulators in the insulin pathway if we then add all of the other regions so if we look at all of the other five regions that we find we actually find that multiple genes within the insulin signaling pathway are located in our in our QTL regions or in our novel QTL regions that we define so we think that this is a much better method a more powerful method to detect QTLs so instead of looking at individual time points asking the question does this marker influence our phenotype we should be asking the question based on the markers that we already know have an effect on our phenotype can we find markers that affect the growth curve of individuals and we think that it's more sensitive because it uses all the available data it's not limited to a single time point it's not limited to your experiment you can also include things that you already know from previous experiments and you can correct for known genetic effects on your phenotype so we detected five novel loci and within the NR3 we see segregation disorder distortion and then FOXO1 is the only gene in this region that has a good that is within this segregation distortion so we do believe that in these Berlin-Fettmaus there is something wrong with their insulin signaling pathway because in each of these regions that we identified we find genes which are located in this insulin signaling pathway good so that was what I wanted to tell you guys for today so in the first part we talked about random effects we talked about mixed models so I tried to explain to you guys what is a random intercept a random slope a random intercept is when every individual is allowed to have its own mean phenotypic value or its own mean voice or its own personal growth curve a random slope model is when you let the factor of interest you allow the factor of interest to be different for different individuals some individuals don't want to be polite they always talk the same no matter if they're talking to a teacher or if they're talking to the president and other people they do care they do feel pressure when talking to the president so they will use their sweet voice so to speak and that is when the effect of the thing that is under investigation is allowed to vary per individual we call this a random slope because it's not just the intercept but it's the slope that's different so for the assignment today is relatively easy there's no programming involved today it's just reading the tutorial and understanding for yourself get clear what is a random intercept what is a random slope and think about how you can use this in research because in the end you're building models for your own data you might have your own data on plans or planned field research you might be working in sociology you might be working in psychology but these issues that we are faced with in science are the same for everyone because everyone has to deal with the fact that some of their data is not independent and of course do practice so do the additional assignments which are on Moodle or on my website or find some other source of programming exercises because of course the only way that you're going to learn programming is to just sit down behind the computer and program stuff and of course I gave you my example on how linear mixed models can improve QTL detection in our Berlin FetMaus and I hope it was a little bit convincing paper is still well it's not under review we already got the reviews back so I have to answer the reviewers comments and then we can resubmit the paper again so I hope it will be published soon but I thought it would be good for you guys to see how you can use these linear models these linear intercept and random slope models to improve prediction and improve detection of things like genes for interesting phenotypes and of course this is not just it's just an example for you guys so that you can that makes it a little more personal I think good so I think that was it for today so if there are any questions then speak now or forever be silent and if not then I will see you guys again next week we will be talking about the last part of linear models we had basic linear models multiple linear models last week this week linear mixed effect models and next week we will be talking about generalized linear models and then we've done more or less all statistics because we've dealt with already in the lectures before we already dealt with things like t-test non-parametric statistics and then we did standard regression we now do LME linear mixed effect models and next week we will do generalized linear models and then all of you guys will be statistical experts because then you guys know exactly how to model your data or at least have all of the tools that are there to model your data like a real statistical professional good good good good good so if there's no questions I'm just looking at chat I'm not seeing any questions then for me this is it and I hope to see you guys next week and enjoy the beautiful beautiful weather like when I look outside of my window it's just blue thank you for the lecture dropped an email wondering if we could print together already let me know we will have to pick a date I think I'm pretty busy about yeah so this week I'm really busy because we're still doing dissections and stuff and I need to help but next week we can find a day where I can show you the printer and you can print something that's not an issue all right Misha see you next week enjoy your pizza we already had yesterday and it was really cool to get home in time and I had a drinking date and this kind of thing so no pizza today but I already had it yesterday and it was indeed very good and I enjoyed it a lot all right then thank you guys for being here and I will see you guys what do you mean what yeah well I ate my pizza yesterday like I was oh drinking I didn't drink alcohol I just did cutting up bats is cool no it's not cool it's very sad because they are really really fucking sweet and like it always weighs on your soul like animal research is never cool it's just necessary or that's also a weird thing but yeah it is necessary in a way but it's not cool and it's actually pretty sad but it's good to as a bioinformatician or someone who sits behind the computer the whole time it's good to see where your data comes from and it's good to know the peculiarities that happen when you do these massive dissections because there's literally like self people involved and things need to be coordinated and things go right cool is not the right no I know what you mean it is a really well fun is not the right it's a fun project to work on but it is a it is difficult so I will be making auto lingley butter noodles with poached egg and spring onions that sounds very very good alright then I'm gonna quit the stream very very quickly and run home to get my auto lingley butter noodles with poached egg and spring onions so I will see you guys next week I hope you enjoyed the lecture and thank you guys for being here thank you guys for subscribing as well we almost are at 1500 subscribers I never thought that that would be possible when I uploaded my first video to YouTube I thought like well if there's like 20 people that will subscribe and 20 people that will like it then it will be interesting but no it's actually quite good so did to buy you so see you guys next time and thanks for being here and I will see you guys next week same time same place and thanks