 to recording. So welcome people on YouTube and also on Moodle. I've recently started to upload all of the lectures to YouTube as well. So welcome YouTube. I'm hoping that that will catch on and I can make a shit ton of money on YouTube. I become YouTube famous. I don't think that that will happen, but I have to have a thousand followers on YouTube actually before I can get like a cent from them, which is strange, but that doesn't matter. So stream will start soon. Nope, we already started. So today we will be doing algorithms and functions. So algorithms and functions. I put in this lecture because last week we had statistical testing and next week we will have more statistical testing, but since I don't want to talk about statistics the whole time, I want to talk a little bit higher level about programming and introduce you to some concepts which are very general and not so much exclusive to R, but I will show you how you can do them in R so that that will be interesting. And I adapted it from a previous lecture that I gave at King's College in the UK. So today we will do the answers to the exercises of course and then I will have a little bit of the motivation for you guys or a little bit of motivation since we're halfway. So if you need a little bit of motivation then that's that's possible and if you need a little bit of demotivation because you're too excited about the lectures that's perfectly fine. And then we will go and do a lot of theory because the theory is things that I can ask questions about because I know that people don't like to write down programming or programs on paper when we do the exam. It's much nicer to answer questions. So to have questions we have to talk about things like algorithms and design patterns. What is recursion? What are higher order functions? And then some more exercises about recursive functions which I think is always interesting. Alright so let's start and I will go to my notepad plus plus window because we want to do the answers to lecture number six. So lecture number six was kind of a difficult one I think. During the Tuesday talk we had a lot of questions about some fundamental things and I hope that I will be able to remember all of the kind of things that people ran into. And of course if you have questions or something like that then just directly throw them in the chat right. Then I can help you guys. So again like I always do I start off by writing down what the file is for right. So always remember to add a header to your file so you know what's in the file, when you modified it, when you first wrote it, who owns the copyright through the file. So the other thing that I generally structure is that so my my files start with the headers then if I am using libraries or external packages in R then I always start with the library calls. So that people that open up the file and want to use the code can directly see oh I need to have these six packages installed right. It's not an absolute requirement but it's just nicer for the people that use your code because they can see oh I need five packages so I can install them instead of having them go through the script and then at line a hundred they have to install a package. So I try to do this this kind of above the set working directory because you don't need a working directory to install a package. Then I set my working directory to where I've stored my files and today we are going to use two files one of them which is called ArrayData and the other one is called Arrays. So this was the way that I loaded them in. Of course you can load them in a different way. I'm using the read table function you can use the read CSV function as well. The only thing that you have to kind of be aware of is that you have to set the check names to false when reading the ArrayData because there are some names which are not proper variable names so setting check names to false will keep them as is. Alright so let's load in and let's go to R. I already loaded in a little bit so let's just hide that for you guys because I wanted to make sure that I had the library installed. So first off I want to talk to you a little bit about the files that I gave you. So the ArrayData contains MicroArrayData because we discussed MicroArrays last time. So the ArrayData looks like this so you have the name of the probe on the Array. So GE Bright Corner is the positive control so that is a sequence which is always available so it should be kind of 100% on. Then you have the Dark Corner. The Dark Corner is the corner which doesn't have a probe so it should always be off. It's our negative control and then we start off with the probes and the probe have these kind of cryptical names saying A55P something and this is the sequence that is targeted by this probe so this is the sequence which is located on the MicroArray and then of course we have the different samples and for each of the samples we have the intensity value so the raw intensity value as read out by the Array. Furthermore we have our Arrays file and I realize that it's a little bit stupid of me to call one of the variables ArrayData and the other one Arrays because they are very similar but the Arrays file looks like this so they have the file name and this is the original file name or the file that we got from the company right because we don't do these things here in our own lab. We send away like DNA or we send away RNA to a company and they do the MicroArrays because we don't have the equipment here. Then you have the Atlas ID so the Atlas ID is the ID of the company so the company is called Atlas. Then you have the Strain so Strain stands for what type of mouse it was in this case so we have a mouse called BFMI which is the Berlin Fat mouse inbred line and there are a couple more so if we look to the whole file right then we can see that there are also B6N and B6N is the standard laboratory mouse that's just some extra info and F1 here is called it's a it's a cross between the Berlin Fat mouse and the B6 mouse and then the tissue column contains the different tissues on which they were measured so HD stands for hypothalamus and individual is the individual ID so this is the ID of the mouse in the mouse house so we see that for example from this individual we took hypothalamus and from this individual we also took gonadal fat so the individual IDs can be repeated and that just means that you took two tissues from a single animal. Alright so those are the two files that we have and now we will start doing some analysis so let's go back to the assignments so the first thing that I wanted you guys to do and ask you to do is make a box plot to show kind of the distribution of the data so the way that I do this if we look at the R window quite quickly right if we look at the file here so the array data and we have this sequence column that we need to get rid of so there's two ways of doing this and both of them are valid because we can't make a box plot of sequence data right because sequences are characters and box plot needs numerical values so we can only use the real measurement values so the way that I did that in here is just saying that I use array at last ID right so this is the column that has the company ID and all of these are occurring in array data so I'm selecting here only the columns of array data which have numerical values in them and then I just say box plot because box plot just goes through the columns and plots the columns as a box plot one by one so how does that look well if we look here and we go back to the R window then if we just do the basic box plot it it takes a while and we will soon see why it takes a while and let me actually do LAS equals 2 so that it changes the the axis and it takes a while again but now we can read all of the sample names as well at the bottom but you can see that like the box is not even visible right there's like massive amounts of outliers or at least that's what it looks like but this is because it's micro array data and micro array data doesn't follow a normal distribution at all or at least the raw data doesn't follow a normal distribution because you have genes which are off and there's a bunch of genes which are off and then there's some genes which are on right in a certain tissue if we look at for example fat tissue more than 60% of genes are off and so you see that there's that there's there's a big over representation of zero intensity or very low intensities in our data set and then of course you have genes which are on so but that is the minority so it's not a normal distribution as all so we have to remember that and we have to start dealing with that as well so just to just to have that clear we see here the different well the HD stands of course for one tissue again then we have the animal IDs attached to it and we can now also read them so that's what the LAS is to does the LAS is to it just rotates the axis 90 degrees so you can see or you can better read them in this case so the first thing that we want to do to the array data is to do the log to normalization so for the log to normalization I use the apply function so and again like this is the part of the array which has the data so I just repeat this right so from the array data select the columns which are in the arrays of LAS ID and then apply to that to the columns to to a log to transformation and then I directly just assign this back into the same columns so I'm taking the columns out doing a log to transformation column by column and then I put the whole matrix back in from where I took it so I directly override my data and this is of course a destructive a destructive operation because I override the old values and generally I don't like doing that but since the data size is relatively big I'm going to use destructive operations every time which means that if I copy paste this line twice it will do twice the log to which is of course not what I want so I have to remember that that I cannot just restart the script from any point but that I always have to start the script from the beginning just to make sure that I don't do an operation twice by copying in the same code twice so let's do the log to transformation and directly make a box plot again so is it clear why I'm using this system can anyone in chat actually tell me what I could have done as well to kind of get rid of this sequence column because there's a much much easier way of doing this so and I will copy paste it in so if you know just throw it in chat what what you did or because there's an easier way of getting rid of this sequence column so just copy it and then we go to our and then we do the box plot and now we can see that the box plot looks already a lot better and this is the same box plot that we saw in the in the assignments so there's no answers everyone's sleeping again I know it's beautiful weather so like I don't blame you guys for not giving any answers in chat but it would be appreciated if you have use the subset function yeah that's true but that's again a big function right so the thing that I was talking about which you could have done which during the last Tuesday discussion came up was that you can just say array data and just do minus one right throw away the first column and then just save array data right just just remove the first column just say minus one and that's one of the nice things in our like this is really something that only our supports there are no other programming languages in which you can kind of delete a single thing I also you subset without the sequence column yeah yeah you can do that as well but in this case since we only want to get rid of one column we could just say minus one and then it will just drop the first column and then we just write it in array data of course from now on we don't have the sequences anymore but that's kind of what we want so Florian welcome to the lecture why are you not a VIP guy why don't you have a nice like diamond in front of your that's that's so bad that's so bad I don't think my moderator can do that but I can let me see if I can make you a I can make you a Florian ebers Valde right there you are and I'm going to make you an VIP welcome welcome no VIP no salary how do you mean no salary well I made you a VIP you can ban him yeah no well you just just mute him for like 60 seconds or something but he's a VIP now so it should be all fine all right so so this is the same this this is something that we could have done as well so just throw away but you can use this upset function as well so that's perfectly fine all right so what do we see when we do the lock-to-transformation right and we've just done that and then we do see let me flip the round the X's as well then we now see something which is very common and so we see the microarray data every array has a slightly different mean and this is just because of kind of random random random noise in the in the microarrays because every time that you scan a microarray the conditions are slightly different and it it shows you that the means are slightly different but we see something here which is very important and we have to remember for the analysis later on so for the analysis later on here we see that the hypothalamus samples sort of the samples from brain have a higher average expression than the ones from gonadal fat so when we do the normalization and we we give every array the same mean we are introducing a certain bias which is which we kind of should not do and we talk during the Tuesday like or during the Tuesday kind of open discussion we talked about this and how you could handle this so when I upload the file moodle then you guys can see that but you have to remember that when we normalize now we are losing some real biological information and that's of course what we don't want because normally when we do normalization we don't want to lose the fact that in the brain more genes are on then there are on in fat tissue so and that's that's something that's very fundamental but we can't really deal with that at this point so at this point and also the assignments just ask you to do a quantum normalization so again I do the same thing so I take out this part right so that's the that's the part of the array data which contains data and then I say as matrix and I have to say as matrix because I have to force it to be of type matrix and then we just use the normalize quantiles function so the normalize quantiles function does quantal normalization for us and again I just store it back into the array data so and then I make a box plot and after that there should not be any difference between the different arrays anymore so we should be we should be rid of this kind of pattern where some of them are high some of them are low some of them have outliers and has some of them have a much it don't have the lower corner or the dark corner expressed at a low level but so when we do this so we do the normalize quantile function and then afterwards we see that now every array it has they have the same median they have the same quantiles and they also have no outliers anymore because now everything falls within two standard deviation so it kind of forces every array to be a normal distribution and there are advantages of doing this it makes your statistical testing better but there are also disadvantages which we will see further on when we are discussing the rest of the assignments. Alright so let's go back so the next question was to make 16 plots and in these plots what we wanted to do is we wanted to show the correlation of one array versus all the other arrays so we can do this using a for loop or we can do this using an apply so the apply we did during the Tuesday assignment just to show people how you can do it but I'm just going to first do the for loop so first things first we need to ask for 16 plotting windows so 16 is nice because it's 4 by 4 so I'm just going to say set my MF row parameter to be 4 4 so that will allow me to do 16 plots in a single window then I'm going to say 4x in 1 to the length of the arrays atlas ID right because the arrays for each row in the array we have the for each row we have like the sample so we take out the column atlas ID so then we have a vector and the length of this vector is the number of samples that we have so again this is 16 so we could have just say for 1 to 16 but I do like using the length of a vector because in the end we could get more samples later on and then I don't have to adjust the script because the script will just read in the annotation file and if the annotation file now contains like 30 animals then I don't have to update my code because otherwise I have to go through my code and at every point do a find and replace and replace 16 by 30 and that's something that you don't want so you always want to make your code depend on input files and not on fixed numbers so we just go through all of the row or all of the elements in this in this vector and then what do we do okay so this is a big statement so let's break it down right so we do an as numeric so we want to have numerical values so we just want to force it so what do we do well we use the core function to calculate the correlation and here you're seeing that I'm using this minus x right so like minus one to throw away a column so here what I'm asking is the correlation so calculate the correlation between the array data and from array data take the column x from this atlas ID right so this is the name of the sample that we're going to use to calculate correlation on so had to make it a little bit more clear I could say sample from right so the sample from which we are calculating correlation so this is the name of the sample from which we are calculating correlation so and then what do we do well we do the correlation of this one vector versus the array data and now I throw away this one sample so minus x right so x here is selecting it and here I'm removing x so keeping all the other 15 so and then I just select that from array data so here I do the correlation from a single vector of like 30,000 measurements versus the array data which is again 30,000 measurements but this time it has 15 different columns because I'm throwing away the column that I'm currently looking at so now I have defined this so I can then say as from so I'm calculating the correlation I'm storing this in a variable called correlation and then you do the plot so you say plot correlations on the y label put correlation the main so the title will be the name of the sample that we do the correlation from I disable plotting of the x legend or of the x label and also the x axis because I want to put the names there right so the names are suppressed and I don't do any tick marks and then I just add my own axis so I say on the first axis so on the x axis put the names of the samples which I am comparing to and those have to be at one to the length of this thing and of course one to the length of this thing is 15 so it's actually one to 15 and then I do LAS is two because I want to rotate them because I want to see them all right so I hope that that's clear so let's run run it in R so that you guys can see how this looks like so then we get this one big picture which looks a little bit like this so on the top we see the sample from which we calculate the correlation and then here we see the other samples without the sample which was used to calculate correlation to right so we do correlation from one sample to all the other 15 and what we see here is something that we kind of want to see because we now see that if we look at this hypothalamus sample right we see that it is highly correlated with the first three hypothalamus samples then there's a drop in correlation towards the gonadal fat samples and then again the correlation is high towards the other hypothalamus samples so this is really good because this gives us an idea that yes our data at least for this sample is okay right because hypothalamus should look more like hypothalamus and gonadal fat should look more like gonadal fat so if we look at the first gonadal fat sample from the same individual then we see here the same thing so we see low correlation with the hypothalamus samples and high correlation with the gonadal fat samples so that means that at least the company didn't swap one of the individuals and because when you're doing research it's very easy to mislabel a single sample so had to do these kinds of quality control plots is very very important because if you if there's a sample swapped then you could directly see that right because then one of the hypothalamus samples would be correlated to the gonadal fat sample and then you would say oh something went wrong there and you could probably even figure out which two samples were swapped with each other so just a basic quality control plot all right not to the powerpoint let's go back to notepad plus plus so and here we have the apply function if you want to know how we wrote this one it's very similar to this one but then watch the Moodle there's the one hour about talking about how to do this all right so the next question was to do kind of another quality control plot where we look at the heat map so we use the heat map function on the correlation but now we just say calculate the correlation of each or each column of the matrix to each other column right so again i use this arrays data arrays atlas id just to get rid of the first sequence column so when i do this in r then we get a heat map plot and the heat map plot kind of confirms the thing that we already know because here you can also see that our data falls into two major groups so we see the hypothalamus samples having high correlation to all of the hypothalamus samples and we see the gonadal fat samples having high correlation to the gonadal fat samples which is kind of what we want to see right one of the nice things that we also start seeing here is that there's a little bit of substructure right we see that there's like three samples here which seems to be closer to each other than the other one so we start seeing that the data falls out not in just in two groups but also into kind of subgroups which when we looked at the array annotation file we already saw that there were different types of mice right we had Berlin fat mice we had the standard laboratory mouse and we had a mixture between the two so we do start seeing also some substructure but for this plot and when we look at it the conclusion should be no samples were swapped in our pipeline right so when we did the extraction we put everything in cups that all went well we sent it to the company nothing was mixed up by sending it the company didn't mix anything up so the data that we have is good it is reliable that is the conclusion that we can draw at this point all right so the next step was then to split the data and do all kinds of kind of statistics on the data so more or less descriptive statistics so to do this the first thing that I wanted to do was split the two types of tissue so I'm going to say arrays tissue is hypothalamus then I use this so these this is a true false vector right so for every row in the arrays it will tell me if this sample was hypothalamus yes or no and then I can use this true false vector as the index so I'm only going to take the rows for which the tissue column was hypothalamus I throw it back into the arrays and now I say give me the atlas id so hate samples contains the names of the individuals which have which are hypothalamus and then I'm going to say well take the array data and now only take out these columns and call them hate so capital ht so those are only the samples which are hypothalamus and then we do the exact same thing for conato fat and this is just to make it easier for me right I could have kept all of the data into the array data structure or into the array data matrix but I'm just going to subset it and of course you could have used the subset function as well I just like to do this and this is because generally you don't want to select on one thing but you want to select on multiple things right I might have wanted to select only the bfmi individuals and then by doing it like this I could just say and and then I could just say well I want to have the species b bfmi right so the subset function is good but I like doing it more or less explicitly because it allows me to to add more things later on and so if I only want to look at the bfmi then it's relatively easy to do it in in this structure you could have done that with the subset function as well but I I just like being very explicit and I use this structure a lot so I use it a lot where I say arrays arrays tissue hypothalamus atlas ID right and I always read from the inside out so when I start reading code I I always start reading from the inside out so the first thing that I see here is arrays tissue is hypothalamus and then okay so this is then a selection and then I say arrays atlas ID so and now in my mind I know okay so I'm going to go through every row determine if they're hypothalamus and then take the outlaws so call this ht samples then I make my subset and I do the same thing for gonadal fat so let's go to r let's run this so just to show you guys that this really works so now I have my ht samples oh ht with a small letter and these are the samples which are hypothalamus you could have used the substring as well and we discussed during the Tuesday seminar that you could use a grep function so there are multiple ways also of doing this and all of them are fine I just do it like this and this was this was my my way of doing it all right so then we wanted to calculate the mean for the hypothalamus for each gene we wanted to calculate the mean for gonadal fat for each gene we wanted to have the standard deviation for hypothalamus and gonadal fat for each gene and then we wanted to do a t-test between hypothalamus and gonadal fat to see if there's a difference between them so I do this all in a single for loop so the first thing that I'm going to do is say I'm going to create something called results right and those will this thing will contain all of my results initially I have nothing calculated so I say results equals zero right I just put something in so that the variable is defined but now it's a matrix or a vector but it has no type it has no structure there's just nothing in there but the variable exists so I can use it for adding things to it and then I'm just going to say for x in the row names of the array data so I just go through each of the rows in the array data and then what am I going to do well I'm going to select the row from the hypothalamus and call it ht and I'm going to select the same row for gonadal fat and call it gf and then I'm going to calculate and make a vector right so this vector now contains the mean of the hypothalamus samples the mean of the gonadal fat samples the standard deviations and the t-test between them and I'm just going to remember the p-value because I'm not really interested in the rest of the of the t-test right the t-test does a lot of other computations as well so I'm creating a vector right and this vector is called values so this is a vector which has a length of five and then what I'm going to do is then say well take the results and row bind so take the matrix that we have which initially is completely empty and now add a row to the matrix general gulag how programmers get a girlfriend girlfriend equals lots of code yeah I don't know well I don't know generally girlfriends programmers are kind of an incompatible thing although it used to be more in the 80s right you had like these movies like what were they called like revenge of the nerds and stuff that was crazy time when nerds were not popular like I think with with bill gates nerds became really popular in a way because like lots of money so and that's I think how a programmer gets a girlfriend is like lots of money because you program so you earn lots of money but then this is what we're doing become a rock star programmer yeah become a 10x programmer that's that's what everyone wants but so let's go back to the assignments we can talk about rock star programming later on I would say become a streamer programmer that that's probably like streamers are nowadays they are the new nerds in a way because streaming used to be like a fringe kind of thing but since people start watching streams more and especially due to the pandemic I think that streaming got a big big boost so become a streamer programmer like not a rock star because like then you have to play an instrument or be able to sing which is I can't but and so everyone understand 3b if you have any questions just ask them right because that's what I'm here for if there's no questions then I'm just going to run it into our because it will take a while so we have like a couple of minutes to talk while it does all the computation because we have to do 30 000 times calculating a mean calculating standard deviations and then do the t-test which also takes some time yeah but the idea here is we start off with an empty matrix we calculate the values that we need and then we row bind this to the matrix and then in the end what we're going to do is just say well the row names of the results are the row names of the array data because I went through the row names of the array data one by one and then in the end I also add column names so I say the mean of hypothalamus mean of gonadal fat standard deviations and the t-test value all right so let's run this and let's go to R and this will take some time so this this is something that we can wait for and let me change my moodlet as well so I'm going back to zombie instead of being a robot so yeah if you have any questions you can ask them now while we wait or I can just sit here a little bit and do like but it's really slow and that's that's just the way that it is because have we're doing a lot of computation and of course if you wanted to do something like this in excel it would be it would take a while as well because excel has the same kind of issue it's like a lot of rows so we're doing I think 30 000 genes or 50 000 genes even so I'm just gonna sit here and talk over it a little bit so anyone got a nice joke just throw it in chat that no nice jokes I should remember that there's a little bit of a delay with twitch so when I ask for a joke then you only hear that like a minute and a half later so Florian you have a joke you're you're a funny guy a rant is also fine Florian I know you like ranting just as much as I do so nothing then I'm just gonna have like two minutes of dead air that we're just looking at the screen and just waiting NFSW that's not suitable for work that's actually written the wrong way around not suitable for work jokes yeah yeah well you can put them in chat I don't have to read them so like do I have a joke I don't really have a joke no I don't really have a joke on another note would it be possible to have the first exam a week later uh probably but then it wouldn't be on my birthday but I guess that you are busy then right on the 15th I will write it down and I will I will ask the proofing bureau if that is possible so question exam on the that would be the 15 plus 7 is 20 second right and that's a bad thing oh do you mean it's a bad thing of having it on the 20 on my birthday no I actually like having it on my birthday because then you can all wish me a happy birthday and because it's the pandemic then I don't feel so lonely and that's that's always a good thing right would be ideal for me too okay so there's two people that want to move it I will I will ask the proofing bureau then if it's possible to move them to move it a week later um general gulag my flatmate just dropped half a liter of water over his laptop and just asked for a screwdriver a screwdriver a week later would be the 22nd but the 21st would be even better all right you know what I will put a voting on Moodle thing then everyone can vote for their preferred date right is that a is that a deal for everyone does everyone think that that would be good that's a good idea yeah yeah yeah then everyone gets a vote and then we can just pick the date at which most people are available good ask them off yes yes yes good okay then we're going to do that um still running takes a long time to do this I should have done this beforehand um sounds sweet yeah all right then uh yeah so any other suggestions because we then need to have some dates that we can put on the vote right so I will put the 21st 22nd also the 15th because that's my preferred date but um if you don't want to spend my birthday with me then that's fine with me for every week later it gets harder let's say double the questions every week no I'm not going to double the questions I already have to make 42 questions because I got a birdie whispering in my ears that there are students of this course who are looking for exams from the last couple of years so that means that I am forced to make completely new exam questions which I hate doing uh anytime during the week of the 21st would be great okay so then I will put the whole week uh and uh on Moodle and then we can vote on it but yeah I'm not I'm not too happy about people wanting to have the questions from the like last couple of years but uh I think I'm generally pretty good at not giving out the exam questions when we still did the kind of exam in person you had to give me back the exam questions plus your answers when leaving the room so that people couldn't take the exam questions with them and I also always made like 42 questions so that people can't really remember all of the questions and then write them down after the exam because I don't like people cheating on exams but uh I know that people do but um I try to prevent it as much as possible but uh I had I had someone whisper in my ear that someone is asking other people to do not underestimate a student's ability to remember questions is it really an underestimation of students like if you're smart enough to remember 42 questions then then you you deserve to have these 42 questions like that you can write them down afterwards right and I always change the exams anyway so the the questions will be very similar but not identical um because I do want exams to be more or less equally difficult every year if they had only used that gift to study yes yes if they are smart enough to remember 42 questions then they could have learned like 16 slides because in the end like the overview lectures like between 16 and 24 slides or something um and that's more or less everything that I think is important so all right and we're done good good good so we're done so let's look at the results um oh yeah results sweet sweet results so here we see that we have the dark corner the bright corner the dark corner we have the means and we have the standard deviations and then we have the t-test and then for all of the probes we have the same thing and when looking at this data can anyone tell me why we should be highly suspicious of this data at the moment with all of the things that I told you like uh why is when you see a result like this why would you go like yeah I don't know if my experiment went correctly um why why why what went wrong what went wrong so uh you tell me because I know like I I did the experiment so I I know what went wrong and this of course has to do with the stuff that we talked about at the beginning and um have when we were looking at the different arrays just gonna have a little bit of that error because people need to be able to type in their answers as well and uh so no one wants to do a guess just do a guess you can do it like I I know you know and if not I'm gonna ask Florian because then then Florian can come up to the board and give his answer and busy oh that's that's that's too bad I am clueless all right so um okay let's let's look at the data right so we look at the bright corner and I told you guys the bright corner is our positive control so what don't you want for your positive control well here we see that in hypothalamus the positive control is 14.0 the gonadal fat is 13.1 positive control then if we look at the t-test value we see that this is a highly significant difference so we see that the positive control between the two tissues is not the same and having a positive control which is differentially expressed between your two conditions that you want to compare is of course a very bad thing your positive control should be the same in tissue one compared to tissue two and that is what's going wrong here and this has to do with the fact that in hypothalamus there's more genes that are expressed than in in gonadal fat but it's not just the amount of genes it's also that something made the positive control light up more intensely in in the hypothalamus compared to the gonadal fat right and the very very significant p-value this is 4.5 times 10 to the minus 9 so there is a really really significant difference between the positive control and that is something that you never want you don't want your positive control to be differentially expressed you see that the negative control is fine right so the negative control is not an issue that's not it's like 0.7 so that's not a significant p-value but this one is really really worrisome why is the positive control different from one or from one tissue to the other tissue and that that should make you highly suspicious about the rest of the results but let's go and do the last two assignments so that we're done with the assignments so the first question was use bonforoni correction to correct for multiple testing so yeah you know that's very suspicious that and in this case it's the hypothalamus which is very suspicious but yeah first use bonforoni correction to see how many genes are differentially expressed so bonforoni correction is easy you take your p-value that you want and then you divide by the number of tests that you did so the thing that i'm going to do is just do that so i'm just going to say 0.05 divided by the number of tests that i did which is equal to the number of rows in our result and then i'm just going to compare the t-test p-value so the t-test column off the results and ask which values are lower than this and then i have a tf vector which is again a true false vector right so for every element in the results so for every row in the results matrix it will say it's true or it's false and then what i do is i ask so there's two ways of doing this and one of them is just sum them up right because zero is zero or a false is zero and true is one so i want to know how many are true so i can just say sum them all up but i like to use this this system or i use this system quite often where i ask which ones are true is it also possible to do the p-adjust yeah you can do the p-adjust as well which we do for the for the second one but this is just because it's a little bit more code but i think it's a little bit clearer that what you're doing here so taking your p-value dividing and here we're just adjusting and then asking which ones are below 0.5 but here you get a true false vector you can sum them up or you can ask for the length of which are true so which are true then it gives you the the indexes and the length of this is the number of genes which is significantly different so let's show you the difference between using Bonferoni correction and using Benjamini Hoogberg correction although we already found out that it's highly suspicious right because there's a difference in the in the bright corner so in the positive control but if we do this then we see that if we do Bonferoni correction it tells us that there are 14 000 genes which are different from fat tissue to hypothalamus tissue if we do the Benjamini Hoogberg correction it tells us that there are 31 000 genes which are different from the gonadal fat to the hypothalamus so and of course again these things are highly suspicious because it can't be that more or less all of the genes in the mouse genome are different from gonadal fat to hypothalamus although these tissues are relatively different we would expect things like ribosomes to be not that much different between hypothalamus and gonadal fat and there's a lot of genes which should not be expressed at all so those should also not be different so we get like large numbers and but the thing here what you what you kind of have to learn from this is that when you do Bonferoni correction you get less results but these results are more reliable because there's a higher true positive rate Benjamini Hoogberg gives you more results which are different but of these results the true positive rate is lower but the false negative rate uh so but the true negative rate is is is higher so it's a balance between having having this type one versus type two error so have Bonferoni optimizes the type one error so if something is different then you call it or if something is is is really different then it will be found using Bonferoni and that you don't call too many things different while they're actually not but Benjamini Hoogberg is the other way around so it it it allows you to not miss anything which is different so you accept a little bit more false positives in a way because you want to get the true positives so that's the difference between using Bonferoni and using Benjamini Hoogberg but hey Rigoletti you're true you could have used the p-adjust function also for Bonferoni all right so those were the assignments for today um honesty question for you guys who did it and who was able to do the assignments and what was your opinion on the assignments because this is this is kind of real data right this is data that we collected in like three years ago or four years ago and uh so it's real data that you're working on and I know that it's not your own data and that it might not be exactly what you're used to but I think it's better to work on real data than just using some of these built-in data sets in in R and there's nothing wrong with these data sets of course but I just think it's nicer to work at data which either has not been published or has recently been gathered and so that you can kind of get an idea of yeah no there's really something that we can learn from this and of course you can create some nice plots along the way did it only managed half before Tuesday that's good that's good that's good like I think that if you like for me if you load it in right and you're able to do the correlation plots the the first assignments um then that's already a whole bunch that had doing the box plots had to be able to look at them and so doing half means that you probably got to like here unfortunately I don't have time to do them at the moment because it takes me up to five hours five hours of doing all the questions then you should definitely email a little bit earlier for like a couple of tips that and I'm never to to like if you say that well I don't exactly understand what you mean by this question then of course it just dropped me an email and I can clarify myself general gulag I was on a 10 days field work trip so no are you doing the slave work together with Daniel who's like slaving away in the middle of a field near the Tesla factory and why are it why is everyone so busy like just did the first part until 2a didn't have time to finish all right doesn't matter like as long as you tried and of course spend some time right you can look at my answers then think about okay so how would have how would I have done it and does this kind of overlap hey in the end the assignments are there for you to practice we discuss them just so that I show you how I would do it but of course there's many different ways how to do this and don't just say well we're done with lecture six or we're done with lecture seven so I'm not going to look at the I was in lower Saxony okay interesting so what kind of a field trip did you have then but do try to do the assignment so if you got to 2a then spend another like half an hour this weekend to do question 2a and try to do it yourself I finished 1 and 2 that's good that's good you're such such active students that in general like when we still had in-person lectures we would do them more or less together so we would have we would have like a three hour lecture or two and a half hour lecture and then we would do in in the classroom together we would do the first couple and have we would I would have people spend like 15-20 minutes doing 1a and then we would discuss 1a and then people would do 2a and then we would discuss it after like 30 minutes so then the nice thing about in-person lectures is collecting data for shore vegetation oh nice nice so there's a lot of shore vegetation in lower Saxony is there actually a shoreline in lower Saxony isn't not completely landlocked or is it the shore of like a well not a seashore probably probably a lakeshore yeah yeah that's what I thought if you would do a seashore you could go to the Baltic Sea which is nicer it's like the Ozarks so you're saying lower Saxony is filled with like drug dealers and mob bosses that want to launder a lot of money I don't think that's lower Saxony for you but you never know you never know I'm not gonna have any opinions on lower Saxony all right the suit Saxon Riviera yeah that sounds good that sounds good all right guys um we've been on it for an hour so um I will stop the recording and I will take a little break