 Welcome, everyone. If you're watching this on Twitch, thank you for joining me. If you're watching this on YouTube, enjoy. Today, we will be talking about bioinformatics, of course, like we always do, and I have a little bit of, or a couple of slides about DNA meta barcoding. So I wanted to show you guys that. I made a couple of slides last year on a request, and I think that DNA meta barcoding is something that we should kind of quickly discuss, because especially in things like ecology, agriculture, and fishery science, it's used a lot. So I think that just having the highlights or what are the pros and cons of the method would really work. And then there will be a whole bunch of things about literature and literature management, which is not so much related to bioinformatics, because it's related more to science in general. But I think it's important, because a lot of people that do their master, they don't realize that being in science is different than what people expect in a way. It's much more about politics, and it's much more about people you know, and then it is about actual science in a way. So I just wanted to kind of share my thoughts about that and talk to you guys about the literature management part. So the way that I structured the talks for today is that we will of course first do the assignments, because I forgot last week to put them online. So we're just going to do them now live for the kind of first 45 minutes to an hour. And then we will have like 12, 13 slides on DNA meta barcoding. I actually might have added one additional slide. We're not going to talk about meta-metline because we already did that, but I want to talk about things like citations, Web of Science, Google Scholar, ResearchGate, Scientific Reference Managers like EndNote and Mendeley. And I also want to talk a little bit more about version control. Since the first lecture, I actually added this whole part about creating a GitHub account. And we never really got back to that. And I originally was planning to start using it for the assignments. But since many people had problems kind of setting it up, I thought that it would be better to just have the assignments like we normally do. But I'm still thinking about integrating version control, at least with the lectures, perhaps next year. So we'll see that. So this is what we're going to do today. I hope everyone's okay with that. And then at the end, of course, we will have like plenty of time to ask questions since it's a short lecture. I just wanted to remind you guys that the exam will be on the 17th of February. So not the 14th, like I said before, but the 17th of February at two. And the make update. So the second date for the exam is now on the 7th of April. So I don't think that that changed from last week, but I actually got approval now. So the first one should be in Agnes. The second one, I don't know. So if one of the students could check to see if the second one is already in Agnes, then that would be awesome, because then I would know, right? I don't, I'm not a student, so I can't see if it's available to register. But if it is, then go to Agnes and register so that you can can join the exam. All right. So then we will start with the answers to the previous assignments. So let me actually pull up the assignments themselves, which I actually did not do because I was busy doing all kinds of other things, like yelling at PhD students, which I shouldn't do because they try and do a good job. And they do a good job on general, but sometimes they just need a little bit of like high volume motivation or whatever you want to call it. So let me see, gene expression, right? That was the one that we wanted to do. All right. So I have got the assignments in front of me. So the assignments, of course, you can get from my website and we will just do the answers today. So let me show you the answers that I have and let me show you guys the R window, see if that works perfect. All right. So I'm just going to do like we normally do, right? So analysis of microarray data. So create a new file, which will hold your analysis, go to the directory in which you store your data by executing blah, right? So I'm just going to do a new file, right? Because I'm not going to look at my own answers. I'm just going to do it for you guys live. I am going to copy the header and the working directory because, so, and then let me save it somewhere. So I'm going to say answers12.r so that we get some code highlighting, right? So this is where I stored my data. All right. So the next step is load the three files into R. So we have three files in the data set this time. So we have a file which contains microarray data. Let me see how they are called. So we have array data, which is the data itself. We have arrays, which is the description of the array. So which individual went on there. And then we have probe annotation and that's the annotation of each of the probes. So I'm just going to show you guys how I normally would do this, right? So I would normally just do it like this, right? So I would just copy paste the working directory code. I would then go to the R window. I would copy it in the R window. And then I would first do a directory search. So I would just type here, right? And then you see that there's many different files in there because of course I've stored other files there as well. But arrays, no, let's first load array data.txt, right? So the thing that I normally do is just say, read.table. And then we just say, all right, array data.txt, store it in AA, right? Because initially I just want to look if this all works. All right. So then I asked for the dimensions of AA. So that already seems quite okay. 55,821 probes measured on 17 columns. And then the first thing that I do is just say AA1 to 10, 1 to 10, right? So just show me part of the data. And this actually looks pretty good. It detected that there was a header. You see that we have the row names in there. So that looks pretty good. Let me check actually the tail end, right? Let's see if at the end of the file everything is still okay. That also looks pretty good. So that's what we do. So this is just how we use it. And I'm just going to copy this code because it loaded in the data correctly. Go to notepad++ and this is our first. So let me give it a good name. So I'm just going to call it array data. All right, control duplicate, duplicate the line, right? Load in the next file, which is called arrays. So let's hope this one goes as smooth as the other one. So I'm just going to copy paste the loading in. And then, of course, I want to assign this. And the nice thing is if you're doing this in R, right, you can actually flip the arrow around and assign the retable into a variable, right? So that's the nice thing about R is that the arrow allows you to assign back into or in front of. So you can just flip it around and then say retable arrays. And that's called this AA again, right? So now if I look at the dimensions of AA, it's 17 by 5, which seems to be okay. If I then look at AA, then it doesn't seem to be fine, right? So one of the problems here is that you see that it didn't detect the header because it made these header names and it actually made this row number one, right? So I have to update the call and then have to say comma header is true, right? Because there is a header in there. And then if we look at AA, now everything looks fine because the first row is really data. And we have the different column names also being annotated. All right, so just let's take this code. Go back to our Notepad++ window and then we just flop this around and we call this annotation or anot. Anot is pretty bad. So we call this array, for array annotation. And of course, keep it consistent. I can just say array annotation, store it in there, right? Don't forward the arrow. But for R, it's really easy. If you are in the R window and you've already typed like a whole bunch and you then want to assign this to a variable, you can always use the forward error to assign to it. Yeah, just call it arrays. I don't know, like keep the file name and the variable name consistent. All right, then there was one more file. So let me go back to R, see if I can find the file name just from the directory listing and the next one that we needed was probe annotation. So again, we're going to do a read.table. And I'm just going to say read the probe annotation. And since I'm using the forward arrow, I'm just going to store it in AA again. So let me look at the first 10 lines. Well, no, first ask for the dimensions of AA. 37,000. That doesn't seem to add up because in the other file, so let me load the array data in its own. So if I ask for the dimensions of array data, then it says that there's 55,000 probes. Right? But when I read in the probe annotation file, it funnily enough says that there's only 33,000. So that doesn't match, right? Because every probe should have annotation. So why is that? So let's try and figure that out. So let's look at AA, the first 10 lines. Okay, that looks okay-ish. We just have to make sure that we load in the header correctly. And of course, we want to have the first probe names. We probably want to have the probe names correspond to the row names. So let's do that. So we're going to say, read table. This has a header, right? So a header is true. Let's look at AA again. That already looks a little bit better. And now we want to say that the row names are in column one. So we just say row names is one. And then it says duplicate row names not allowed. So we're not allowed to do that. All right, then perfectly fine. So now we're still here, but I'm still a little bit confused about why it didn't read in all of them, right? And that can happen. It can happen that when you have one of these tabs separated files with annotation, that somewhere there's like a hashtag, right? Because hashtags are comments according to R. So if somewhere in the annotation, for example, here in the MGI description column, if there's a hashtag, then it will stop reading at that point. So one of the things that I can do, right, is just say instead of reading the table, let's just use the comma separated file function and then say the separator is a tab. Just to see if they come up with the same thing, right? So now if I ask for the dimensions of AA, it's still 33,000. That's an interesting, so tail of AA. That also looks to be perfectly fine, actually. There's no mismatch or things that got shifted. So just wondering why it only has the annotation for 33,000. Use the dim function. How many probes are in the array data file? So that's already what we knew. So actually, let me just copy paste the code. That's weird that it doesn't really match up with the two. So I'm just copy pasting the code that I just wrote into notepad. And then I'm just going to say probes like this. Right, so we're just going to do something like this and then make sure that we read in all of the files properly. So dim array data. So there are 55,800 probes, but if I ask for the dimensions of the probe annotation, it says 33,000, right? And then if I ask for the dimensions of the arrays, then that is 16. And that makes sense, right? Because the array data had an additional column, I think. Yeah, it has the sequence in there. So that makes sense. So the dimensions match because there are 16 individuals annotated. There are 16 individuals in the array data plus the sequence. Oh, I'm very sorry. Thank you, Xanaxin, for notifying. Right, so I had just copy pasted in the code, just looking at the different dimensions. So there is a mismatch between the dimensions here. So the question is, where does that come from? And also the question is, is why am I actually allowed to read in the array data with row names? But when I'm trying to read in the probes with row names, it tells me that there are duplicate row names. So there are some mismatch or weirdness in this. So it looks a little bit strange. It looks a little bit strange. It's a little bit strange. Well, let's not fret too much about it. We'll figure it out as we go along. Right? So let's just answer the first question, right? Because we just want to answer questions. So first question was, so Q0. So use the dim function to few different dimensions. How many probes are in the array data file? All right. So we already knew that because when we go to R, the array data file has 55,800 and something probes. I'm just going to take this number and add it to my answer sheet. So that's question number zero answered. All right. Then question number one, take a closer look at the header of the array data file. Is there any structure to the column names? All right. So let me do that. So array data, right? Yeah. So array data and just look at the column names because that's what the assignment says. So we see that the first column is sequence, which is just like one of these access columns. And then we see that there is indeed some structure, right? Because we see HT and then a number or we see GF and then a number. Right? So the structure here seems to be that they're, and are the numbers shared? Yes. Because I see here HT 514. And here I see GONATO FET or GF 514. So there is indeed some structure. So the answer to this should be that, so hashtag Q1. Yes. Structure is Pate or GF followed by a number, right? And numbers are shared between HT and GF. All right. So we learned something. So there's probably some structure there, which we will figure out later. All right. Next question is use the plot function to plot the HT 2010 column of the array data file. All right. So let's do that. So we say plot and then we say array data and we just are going to take out HT 2010. And then that's what we want to do. So let's go to R and make our plot. It will take a little bit of time. But here we see the data, right? So here we see 50 something thousand rows. Every row contains a value. And you see that the values have quite a big dynamic range, right? So most values are between like zero and 10,000. But you see that some measurements actually came up being 30,000 or 300,000 even, right? So they're really, really big. And if we would do a little bit of a histogram, right? Because then we can see how the structure of the data looks like. We see that the histogram has a lot of values here, and then it just dumps off directly. And of course, we can add a little bit more breaks. So let's add like 100 breaks so that we get a little bit more. And then you see that there's a whole bunch of values, which are relatively low. And then we see here that there's a few values, which are higher. And then, of course, the highest values here in the 300,000 range. And we don't even see that because there's so many values in this first one. All right, so that's question number two. So we did that. So we have then in the answers, of course, we have just a plot and that's enough. All right. So three is print part of the other files to your R session and figure out what is in the different rows and columns. OK, so let's do that, right? Because we now want to understand more or less the data. So first, let's look at the arrays, right? Because that's the smallest one. So here we see that there's a file name in there. So this probably is where the data originally came from. We see a comp ID, which is an ID, which is this number, right? So the comp ID actually matches the column names of the array data. So that's good to know. So using the comp ID, we can then select from the array data all of the real data structures. Then we see strain. So strains, of course, are generally like things like inbred strains of mice or cows or other things. And here we see BFMI and B6N. So I talked about our Berlin Fetmau sometimes, or not sometimes, I talk about the Berlin Fetmau quite a lot during the lectures. So indeed, we have like measurements on Berlin Fetmau's. We have B6N and this is something that like if you're in the field of mouse genetics, you know that a B6N is more or less the reference mouse. So it's the mouse, which is the reference genome. And then we see strain F1. So in F1, if you study biology, you know that an F1 is a cross between two animals, right? You call that if you take two animals and you cross them together, then the offspring is called the F1 or the G1. And then we see tissue. So the HT and GF stands for different tissues. And then we see that the individuals are just individuals. So now we know how the data is structured, right? So if we look at the array data, right, we see kind of the same, we see a structure, right? So we see values for each of the individuals for the different tissues. And we see that we have the sequence and then we see that we have the probes. So that looks pretty good. And then we had this weird file, which was the annotation, which we called probes. So we also have probes. Let's look at the first 10 rows. And here we see that indeed, it seems that the probes have been just annotated, right? So we see here that this 55198884 thing is mapping somewhere on chromosome six. The probe starts here and there. Well, yeah. And then here we see that there's a gene there, an MGI symbol, description, chromosome, and then start and end position of the gene. So it's just the annotation for the different probes. What are we targeting? So that's good. So we learned something and we figured out what the different columns and rows are. Select only the columns containing data from the array data file. In other words, do not plot the probe sequences. Use the boxplot function to visualize the range of the different columns against each other. So this is one of these things where we can kind of use our trick, right? Because we know that when we want to do a boxplot, we can just give the boxplot and in this case we just want to boxplot the array data. But the problem with array data is, is that it has one additional column. So the first column is the sequence. So logically, we might want to say we do minus one. So we drop the first column, right? That's allowed in R. So here we say comma, right? So keep all of the rows, but drop the first column. But that's not really what I want to do because now if there's hard coded minus one in here. So it's much better to just use the array annotation, right? And we just learned that there's something which is called comp ID, right? Which matches the IDs of the data values. So this should work as well, right? So from the array data, select only the columns which occur in the arrays file, right? So we're kind of matching these two matrices together. We're using this column from the array annotation matrix to only select from array data the columns which have data. All right, so let's go to R and do this and then see what happens. All right, and this will take a little bit of time, right? Because it's a lot of data. It's 55,000 probes and it's a box plot so it has to calculate medians and standard deviations. But then we see something like this. So this is a little bit awkward, right? If I make it a little bit bigger, we can actually see all of the names. But if I drag it, you can already see that R has a hard time plotting all of these values since there's so many outliers, right? Because we already saw that in the histogram that the distribution is a little bit skewed. So I can still not read the names. So I'm just going to flip the axis around. So I'm going to say LAS is two and that will go and make the axis instead of being horizontal, it will make them vertical so we can actually read them. So now we have a plot which we can read. Okay, so this looks perfectly fine. And of course we have to check but there's no sequence column. So that means that using this other file with the annotation actually helped us. And now we want to use the log two function in R to log transform the micro-array data, save the resulting matrix into a new variable called log two array data. All right, so let's go to Notepad, do that. So we're just going to say log two array data is the log two. And then we're just going to throw in this thing which we made the box plot off, right? So we're going to do the exact same thing. Take the matrix, then select from the array matrix the column company ID and then use that to select only the data values. All right, so that's a log two transformation. And the nice thing is the log two function you can do it on the whole matrix in one go. So if we do this in R, it looks like this. And of course we now also want to plot it again. So let's just make a box plot again of the log two and I'm just going to press top and then it will fill in the name. Right, so now my box plot starts already looking a lot more meaningful, right? And this is the same box plot as what we saw two lectures ago has. So we see that we have hypothalamus or HD samples. We have GF which stands for guenado fat and we see indeed that there's a big difference in expression, right? So we see that the guenado fat samples on general have lower expression than the hypothalamus which makes sense because if we're thinking about fat tissue and then in fat tissue there's not as many genes expressed as there are expressed in in brain, so good. All right, so use the log two function. We did that, save the resulting matrix. We did that, create a box plot. We also did that and we looked at it. So we observe something and what we observe is is that well, there's higher expression in the brain samples compared to the fat samples. Some samples have some outliers. So, and they're like values which are too high and you'll also see that at the bottom there's a little bit of variance. So not every array starts exactly at the same intensity. Good, so let me actually save the box plot call, go back to notepad plus plus because I didn't save that one. And then this was Q number six, right? So let's just add a Q number six and then this was Q number five and I think this was Q number four, right? So just so that we go through and have some annotation. All right, then the next step is installing the pre-process core package from Bioconductor. I already did that, but of course what we normally would do is if you say I want to install something from Bioconductor, we just go to Firefox and then we just say pre-process core Bioconductor. We just press enter, we have Google fix this for us, right? And then when we read here, then it says to install this package, we just use the code, right? So we can do that. So if we just go to R and we copy paste in the code, then it will try and install it, but we'll see. It says that it's out of date, which is perfectly fine. I should actually update my R version. So now, hey, I see that the installation part and able to update packages, that's all pretty well. It just wants to update, so it doesn't install, so I'm not gonna update now. Right, packages not installed when versions are current, so I have it already installed. So I can just directly use the library pre-process core. Pre-process core. All right, so library loaded. Of course, we have to add the library function to our notepad plus plus window as well. And the way that I do it is always put the dependencies on the top, right? Because when people open up your script, they can directly see, oh, I need these libraries installed. And of course, we could have added the Bioconductor code to there as well, but that's just something that you want to do or not. But that's the way that I build up. So my scripts are always, it has a header who made it for what reason, which libraries are required, and then a set working directory, because this is different for everyone, because where you downloaded your data might be on the D drive or it might be on the E drive, but mine is, of course, in my projects folder somewhere. All right, load the pre-process core, losing library and normalize the datank using the normalize.quantiles function. All right, so that's the next step then. So we're just going to say normalize.quantiles. I just copy-pasted it from the thing. Take the log to array data and then store this into a new variable, select an appropriate name. So I'm just going to say norm array data, right? That's a relatively appropriate name. Good, then we go to R, and then we just do the normalization. So, all right, and then it says, ooh, there's an error, right? It expects a matrix in normalize.quantiles. And of course, if I would ask what is the class of this log to array data thing, then it says it's a data frame, right? Because data frames can have different types in the different columns, but in this case, when we look at our log to array data, it looks very numeric, right? Because there's only numerical values in there. So to fix this, we can just force it to be a matrix. So we are just going to say, well, since normalize.quantiles requires a matrix, let's make it a matrix, right? So just say as matrix, and then it works. So just to make sure we check the norm array data, let's just look at the first 10 columns again. And now we see something interesting because actually the normalize.quantiles functions, it actually ate our column names and row names. So, well, since it ate it, we have to put them back on. So I'm just going to go to notepad++. So I'm going to say normalize.quantiles with the as matrix function. And then here, I'm just going to say that the call names of the norm array data are the call names of the log to array data and the same for the row names. Because I want to have the row names and the column names on my new matrix as well, right? Just to know what's in every row and what's in every column. Good, so let's do this, go back to R, see what changes or see what happens. So now when we look at norm array data and we look at the first 10, then now we should have our identifiers back on top and we have our probe names back. So, yay, good. So we normalized our data, right? So we did the log to transformation to get rid of this massive range difference. And then we normalized it so that all of the quantiles are the same. So of course, I'm just going to make a boxplot of this whole thing just to make sure that everything went well. And when I look at the boxplot, now I see that, yes, it did indeed normalize across all of the arrays. Let me actually flip it around. Plus is two so that we can read it and we can now see that every array has the same average expression, kind of the same maximum expression and the same minimum expression. Of course, we now lost some biological data, right? Because initially we had the hypothalamus being higher than the gonadal fat and we now normalize this difference away. So that's just a shame, but we can't do anything else, right? If we want to compare the two tissues, then we can't have this massive difference of mean because if we would have that, then of course any statistical test would say, no, this gene is different and it's not different because it would not really be different. It's just because the one array is much higher expressed than the other. All right, so question eight we did. Question nine, create a boxplot. How does it look normalized like in the lecture? Yes, it does. So we're actually just going to copy paste this code and we're going to add this to the noteplot, right? So this was Q nine. All right, making good progress. When looking at the other files, we observed that this data is coming from multiple individuals and has multiple tissues measures. So we first want to know some more about the relationship between the different arrays, right? Because from the naming of the arrays, it would seem that we have hate and we have GF, right? And of course I expect hate things to be closer together than to the GFs, right? So if I would kind of build a tree in my mind had just based on the annotation, I would expect that hypothalamus samples would kind of cluster together and the GF samples should cluster together as well. So the question number 10 is use the correlation function and create a correlation matrix, save this matrix in a new variable, right? So we just take the array norm data, right? We use the core function and then we save this into core M for correlation matrix. All right, let's go to R and just throw this in. And when I look at my core M, then it looks like this, looks a little bit big, can't really see a lot. So I'm just going to make an image of it. So just that I can look at it, right? And now I can see this nice like checkerboard pattern. And then of course, I want to add the names. So I'm just going to say one to the number of rows of core M, right? So for the X axis use not zero to one, but use the head number than one to the number of rows of core M. And then the next one is going to be one to the number of columns of core M. And now here you can see that it uses the numbering, right? So this is one, this is two, and then this is five. Of course, we don't want this really, right? So I'm just going to say, do not plot an X axis and don't plot a Y axis. So I'm just gonna disable plotting of the numbers, right? Because I want to put the names there, right? Because then I can compare if the hattes are really highly correlated with the hattes and if the gonadal fets are highly correlated with each other. And I'm also going to disable the labels because I don't want the X label and I don't want the Y label because then I have enough space, right? I need some space on the side and I need some space here. Good, so this is the first part. So we made our correlation matrix and then made a little visualization of the thing. And now I have to put in the axis again, right? So I'm going to say axis one at one to the number of rows of core M. I am going to just take the row names of this correlation matrix, right? And I'm going to do the same for the second axis. So the first axis is the X, the second axis is the Y axis. So at is one to the number of columns and I'm going to write the column names there, right? So just to have the plot look a little bit better. So when we go back to R, and we're now looking at this thing and now we see this, again, we can't read all of them. So we have to make sure that we plot it and then do comma LES is two, right? Flip them around and I need to do the first axis as well. LES is two. All right, so now we start seeing the pattern that we wanna see, right? We see that these four hypothalamus samples, so are very highly correlated to these four hypothalamus samples, right? And of course, everything's highly correlated to itself. We see that these four hypothalamus samples also correspond or are highly correlated with the other four and we see the same thing for the gonadal fat. So indeed, our initial assumption saying that just looking at the names, be it HT or GF, have we expect the HT things to be more similar to each other than they are to the GF and the same thing for the GF, we expect them to be more similar than for the HT and that's actually what it looks like. If we wanna make this plot look a little bit better, right, because we can see here that there's not a really nice thing, then we can actually say box and put a black box around it so that it's really a box and of course we could add a title and these kinds of things as well. So I'm just gonna copy paste the code that I made, go to Notepad and then put in the code, just override it and then make sure that we don't do these things. All right, so indeed. So correlation indeed can help us understand the relationship between these things. Question number 11 is use a heat map to plot the correlation so we can also use the heat map function. So this is of course, hashtag doing it yourself and hashtag using the heat map. All right, so we're just gonna use the heat map function so I'm just gonna say heat map core M, right? And then go to R and then I'm going to show you guys how that looks in R, right? It looks very similar, right, to what we saw. It just uses these guy trees because it clusters them and then it reorders them, right? So the ordering here is different from the ordering in my plot, but that's perfectly fine. And we see indeed that the GF samples are highly correlated, the Potoloma samples are highly correlated and they are not that highly correlated to each other. So we did a good job. We didn't mix up any samples, right? When we were pipeting, we didn't pipe at the wrong sample in the wrong cup, which would show up as one of these Potoloma samples having high correlation to the GFs and low correlation to the other ones. All right, so next step is to calculate some statistics, right? So for each probe on the array, calculate the overall mean on the log two normalized data using a for loop. All right, so I gave you guys a little bit of code. So let me just copy paste the code from the assignments into the Notepad++ window, right? So that we have something to start working with. So this is hashtag Q12. So this was the code that I gave you guys. Interesting, this is actually all of the code that it was. All right, so there's no real programming. So in this case, question 12 was just copy pasting the code that was in the assignment into the R window. But there's some additional, choose two groups that you might think be interesting. For example, F1GF and BFMIGF. We can select from the array annotation, the Atlas ID or the company ID of the arrays in this group. We can use this to extract the correct columns of data. All right, we can now use array F1GF to extract the correct columns from the array data. Do this for both groups. Okay, so let's calculate the overall means, right? Just to make sure. So we go to R, look here, right? I did not call it log two norm data. I call it norm array data. So let me fix that because I didn't really adhere to the naming scheme as presented in the assignment. All right, so it takes a little bit of time, right? Because for every probe on the array, the 55,000 that we have, it needs to calculate the means, right? So I have my probe mean, and then I add it to the overall mean. So I initially make an empty variable. I then go through the rows, compute the mean value, store it in the variable, and then I add this variable to the ones that we already have, and then store it back into the overall means. So let me show you a plot of the overall means. Again, so what we see here is we see all of the probes that we have, and then we see here on the top that some of them have some of the probes, some of the genes are highly expressed, as some of them are lowly expressed, and then there's some in the middle. If we do a histogram, right, to see how it kind of looks, we see that it's not a normal distribution, right? Because we see that there's a bunch of genes which are not expressed, and like 14,000 probes are not really targeting anything. They're targeting something which is very lowly expressed, but then we see that there is some expression and kind of the last part seems to be half a normal in a way, right? So there's a lot of non-expressors, and then there's some which are really expressed. All right, so now we have to do the two groups, right? Because we want to take two of these groups, compare them together, and then the assignment was to calculate the means of the two groups and then calculate the log two ratio between the means and then the p-value using a t-test. Use the following structure of code. All right, so that's very good. So now we actually have to code. So let me just copy paste this from the assignments. So this is hashtag Q13, and then we do it like, that's not correct, that's not the code that I wanted to copy. So just, oh. Copy. All right, so this is what we had, right? So for example, we want to say mean F1, gonadal fat, so these are the means, the means for the BFMI animals, the log two ratios, and the p-values, right? So these are just going to hold all of the mean values, then we will calculate the log two, and then we will do the p-value calculation using a t-test. All right, so let's do this. So first step is calculate the mean of F1, GF, and already we had the selection thing, right? Because that was also in the assignment, so I'm just going, oh. Why does it not copy paste properly from the PDF? Right, so what I'm going to do is say arrays, strain is F1, the tissue is GF, right? Because those are the individuals that I want to select, and then I'm going to ask which, right? And now I'm going to say these are the individuals, arrays, F1, gonadal fat, and then I want to get the company ID for them, so this was wrong in the assignment because it says Atlas ID, but I changed that to Comp ID, right? So these are then the names. So let me just run this for you in R just to see how this works. So if I look at individuals on arrays, it says that five and six, only two of them, that's interesting. So if we look at arrays, and then we look at these rows, yeah, then we see that indeed there's an individual 2010 and an individual 2024 measured F1, gonadal fat. If we look at the whole arrays file, then there shouldn't be more. No, so why do I actually want to do this then? Because why not just look at BFMI hypothalamus? All right, I'm going to change it up a little bit. So I'm just going to say, I want more samples, right? Two is not enough to really do statistics on. So I'm going to say take, instead of F1s, takes BFMIs and take BFMI for hypothalamus, right? Because if we go to R then we can see that we probably have, I think four, yeah, right? And four is enough so that we can do a T test. So this is then the first group that I'm going to select and I'm just going to do the other one as well. So let me go back to notepad++. So I'm just going to rename my variable. So this is going to be individual array BFMI and then this is going to be hot day. And then I want to select these and then get the company ID and then store them in array or let's just call them names underscore BFMI hot day, right? So now this should give me then hot day 513, hot day 514, hot day 527. And then if I would look at names underscore BFMI hot day then we indeed see that these are the individuals that we selected. All right, then we want to do the same thing of course not for the BFMI but for another group. So let's take the B6Ns, right? And do the exact same thing. Oh crap, I didn't show you guys the R window, right? So here I make the selection, right? And this is what I expect from my selection. And then what I'm doing is say from the array matrix take these individuals and then take their company ID. And then I just have a vector which has four names in there. All right, then do the exact same thing for the B6s. So let's go back to Notepad, right? So I'm just going to copy paste the code here, right? And now I'm going to say instead of strain I want to have the B6Ns and then I'm going to update my variable name, update my variable name and then the rest here as well. And just to check, so I'm going to run it like this and then I'm going to say names B6Hate. So those are only two again. Why are there only two? B6Hate, Hate, interesting. I thought we had three of these for each of them. Because what I want to do is statistical tests, right? I want to have at least a couple of them. At least like three, because otherwise a t-test is not going to work. So is there any group for which I have more than three except for the BFMIs? How the hell is that possible? Okay, I only gave you half of the data set. That is a shame, that is a shame that we went so far only to figure out that we don't have enough samples to do the t-test. Right, because we need to have at least three samples in each of the group. All right, so let's fake it, right? I was looking at the R window to see if there's any group which has more than three samples in there, but there's two F1 Hates, two, no, four BFMI Hates, two F1 GFs, two BFMI GFs. No, there's four BFMI GFs. All right, so we can do a t-test between those. So we're going to t-test the BFMI hypothalamus which are four against the BFMI gonadal FETs which are also four. So then we're allowed to do a t-test. So back to the code, tissue GF, and then here we want to have BFMI again, right? So we're now going to say this is BFMI GF and the same thing here, names BFMI GF. All right, so now when we run these four lines, we should end up with two vectors of names, right? So we have the names of the BFMI hypothalamus and we have the names of the BFMI in gonadal FET, BFMI GF, BFMI GF. So something goes wrong there because we expect four of them. And here it says HD. So in my code I made a little mistake somewhere. Yeah, here. All right, so this is the selector and then I used an old variable as a selector. All right, let's run it again. Make sure that we check that we get what we expect, right? So the names BFMI HATE are these ones. Names BFMI GF are indeed the same. Good, all right, so now we have two groups, right? Two groups with a minimum of three individuals in there so we can do a t-test. We go back to Notepad and of course this selection doesn't have to be inside of the for loop. So I'm just gonna take this whole code out and do it first, right? So first figure out the names of the individuals and then do it like this because these won't change, right? If we look at the next probe or at the next probe then the names of the individuals will not change within each of the groups. All right, so now I have to select these individuals. So the HATE, so I'm going to say here that and that, GF. All right, so let me take array data, right? We're currently looking. So in the for loop we're going to go through each of the rows. So I'm just going to take an X is my designator for the row that we're currently looking at. So I'm just going to say array data X comma and then names BFMI HATE, right? So if we would paste this into R, we would have an error, right? Because X is not defined or it is, X is actually defined. So we're going to define it as one. We're going to look and then we see that these are the bright corner values for these four individuals. And then we can change this to GF and that seems to work as well. So we see that indeed the bright corner for gonadal fat is much lower now than the bright corner for the hypothalamus, right? And that, of course, it has due to the fact that we normalized. So we pulled all the gonadal fat samples down and we pushed the, no, we had the hypothalamus samples of course are much higher expressed than the gonadal fat, right? Because brain is more active than thing than fat tissue. Good, all right. So these are then our values. So we're going to say I want to calculate the mean of the BFMI hypothalamus and I want to store this in mean BFMI hypothalamus. And I'm just going to say combine it with this and then calculate the new mean, right? And then going to just duplicate the line. So I'm going to press control D for duplication. And then I'm going to do the exact same thing for the BFMI's gonadal fat. I just have to remember that here I select the right columns. So I'm just going to select the names of the BFMI for the gonadal fat. Good, so now I'm calculating the first mean, the second mean I then want to calculate the log two and then add those to the log two ratios. So that's the next step. So next step is say that log two ratios is a combination of the log two ratios that we already had. And then we want to take the log two of the mean hypothalamus divided by the mean of the gonadal fat. So that's the log two ratio. And then we need to calculate the p-value using a t-test, which is relatively easy because we can just say t-dot test. We are going to say mean BFMI hate. No, because we don't want to use the mean, we want to use the real values, right? So we're going to take the four values from the current row that we're looking at and then we're going to t-test this against the same but now we're going to t-test against the gonadal fat samples, right? So here I have a group of four values. Here I have a group of four values. So I'm allowed to do a t-test. Let's run this for the first one, right? Just to make sure that it makes sense. All right, so does it match up with the values that we have? So mean of X, so the mean of X are the hate values is around 17,000, right? That seems more or less correct with the values that we saw. The mean of the other group is around 9,000, seems correct in a way, right? Because some of them are slightly lower, some of them are above 9,000, so that seems to be correct. We have a t-test and then we see that this is significantly different. So we want to remember the p-value. So we need to get the p-value out, right? And I have no idea how to get it out so I'm just going to say str. So show me the structure of this thing, right? So it tells me that the t-test, right? Gives me back a list of 10 items, right? So the first item is the t-statistic, the second item is the degrees of freedom, and the third item is the p-value, which is what I want, right? So it also tells me how to select them, so I need to do dollar p-value. So let's do the t-test, right? So let's copy the code into Notepad++. So I'm just gonna do the t-test and then I'm going to store this rest t for results of t-test. And now I want to take out the p-value. So in R, it tells me that I have to do dollar p-value, which is good, so I'm just going to do that on the rest t, so the results of the t-test. Take the dollar, use the p-value, and then of course I'm going to add this two p-values because I want to remember that, right? So again, the same structure, right? And this structure is very common in R, that you use a for loop to go through the different rows. For each row, you calculate different things, right? So here I calculate the mean, and then I add this to a variable holding all the means. All right, so now I'm getting my mean value for hypothalamus, mean value for grenade of fat, the log to ratio, and then I have my p-value. So I'm in the end, I'm filling up all my vectors every time with one additional value. All right, so seems to be okay. Let's not run it for all of them because that will take some time. Let's just run it for the first 10, right? So I'm just going to make it a little bit quicker. Right, and of course when you're checking things, and when you're running code that you just wrote, you always want to run it on a very limited amount. Good, so there's an unexpected symbol. Mm-hmm, mm-hmm. So unexpected symbol, it crashes where? Resty. Interesting, because this is a common, then it is resty. So this one seems to work because we tested it. Then what is wrong with this one? Nothing, all right. So why, so I'm just going to look, ah, here, right? So I'm going to use bracket testing, right? So I'm just going to put my mouse behind every bracket to make sure that the opening bracket and the closing bracket match. And then when I do that here, I actually see that this bracket here is not closed, right? Because when I put my mouse here, it doesn't highlight red. Well, if I put my mouse here, it says that this bracket closes this one and then we have one more left that we still need to close. All right, so let's close this one as well. All right, then do it for the first 10. Go back to R, see if that works. 20 warnings, all right. So what are the warnings? Argument is not numeric or logical returning NA. That's an interesting observation. So indeed, it did return all NAs. So let's go to Notepad++ and then explicitly make sure that here, we'd say as numeric, right? So transform the values to numeric when we do the mean function just to make sure that it's transformed into means. Of course, we still have to check that it does it correctly, right? But at least this should fix the first warning. So we go back to R. Now it didn't have any warnings. So I'm just going to check the... So I'm just gonna C bind all of these things together, right? Because I want to have a column. So I'm going to say mean BFMI hypothalamus, mean BFMI gonadal fat. Then I want to have my log two ratio in the third column and then the fourth column should be the P values. Looks a little bit weird, right? Because we only did the first 10. So we're only expecting 10. And it is kind of repeating in a way. Interesting. Interesting. Why does it give me such a big matrix then? What is the length of this thing? What is the length of gonadal fat? What is the length of the log two ratio? 55. How the hell can that be 55? Yes, I already figured it out, right? So here we take the log two ratio of the mean BFMI hate divided by the mean BFMI gf. But this is a vector. So this becomes longer every time. So we just have to say take x, right? And here do the same thing. Take x, the current one that we're doing, right? Because we're every time adding a new mean. So then the log two can just work with vectors. So this is going to be an issue. All right, so let's rerun the code. I'm actually going to go to R because I don't want to write the C bind again. So I'm just going to copy it out. And then I'm going to put that in notepad plus plus. And then I'm just going to copy, paste it in. All right, now I get what I expect, right? I expect it, let me show you guys the R window, right? So now I'm just gonna run the code and now I get what I expect because I expect 10 rows to be done. I expect to have a mean for the first group, mean for the second group, the log ratio between the group and then the p value. Seems okay. I'm just a little bit wondering why this number divided by this number, right? That's 1.98. And then the log two of that is going to be indeed almost one. Good, all right, perfect. Good, then we can run it for all of them. So let's run it for all of them. And in that case, instead of doing it to one to 10, we want to do it to the number of rows. Oh, we want to do it to the number of rows of RayData. We want to do this for all 50,000 of them. All right, so now when we see bind, we also want to assign. So this is a resM for result matrix. And then it should work now. And instead of doing 10, we're now going to do all of them. And then that should give us the first overview that we want, right? And of course, there will be many, many different genes differentially expressed between brain and fat tissue. So it's not the most logical test to do, but it's weird that I didn't give you a full data set. I might have loaded an old data set myself, but I don't know, I don't know. I don't know what went wrong there. Like I forgot initially to upload the assignments altogether, so it could be that I just made a mistake and uploaded the wrong data set with the assignments. Because the initial idea is to do this t-test, right? So to do the association tests, or to see if there's a significant difference or not. And of course, that doesn't work when there's only two individuals in a group, because for a t-test, you need to have at least three measurements before you can do a test. And of course, the t-test here is also not really what you wanna do, because a t-test, of course, assumes a normal distribution. And with only four values, you don't know for sure that it's a normal distribution. So you probably want to do a different type of test. Like a non-parametric t-test. Good, takes a little bit of time, right? It needs to calculate 55,000 mean values twice, or 100,000 or 110,000 mean values needs to calculate 50,000 log ratios, and it needs to do 50,000 t-tests. So it's not gonna be quick. And of course, you can optimize this as well, but in the end, waiting a little bit, it's okay, right? You can drink a little bit of coffee, and we're actually done. All right, so let's check the first 10, right? So just to make sure that everything went okay. How much RAM capacity does the computer you use have? Let me check. I think it's 12 gigs, 16 gigs, 32 gigs, actually. I'm just looking at my windows, so 32 gigs. But it's not really memory bound in this case. It's really CPU bound, because we're only using one CPU core, and hey, it needs to do all of these statistical tests. If we want to know how much memory we are using, you can do a GC, I think, and then it will tell you how much memory you're currently using. So we're only using one gig of memory in total. And well, one, so you have the end cells and the fee cells. So the end cells are the ones which are hot, so they are readily available, triggered, so much use, so yeah. So we're using like between 400 and one gig of memory, so. Not, yeah, of course, if you have a very small computer, then of course it takes much longer because, but in the end, the speed of the CPU for this matters much more because you're doing so many computations. But yeah, all right, so let's look at the first 10 of the REST matrix. Looks the same as before, and of course we do want to remember the probes, right? Because we all again see that here, we don't have any row names, but we do want to have some row names. So let's put some row names on. So just say row names of RESTM, RESTM is the row names of array data, right? Because we do it in the same order. This is something that we're allowed to do. So let's go, no, to Firefox, go to the R window and then show you guys that then we indeed have the names. Good. So now, of course, we can use the plot function, right? To, for example, plot the difference in mean. So we can plot the log two ratios and we have the p values. So we could actually do a volcano plot if we wanted to, which we also did in the lecture. So let's do a volcano plot, right? So let's do a plot. So the volcano plot on the x-axis has the log two ratio. So that's easy, RESTM comma log two ratios. And then on the y-axis, it has the minus log 10 of the p-value. So minus log 10 of the RESTM and then take out the p-values column, right? So this should produce kind of a volcano plot and indeed it does. And that looks actually pretty okay, right? Like we see that there are down-regulated probes. There's up-regulated probes. We also see here that there's like some probes which are down-regulated. Do not have the same level of significance but are like more or less 10-fold down-regulated. So there's massive, it's often one tissue and on in the other one. And we see the same thing here. And we see the classic kind of volcano shape and that there's no value which has a log two ratio of zero with a very high p-value because of course there's no difference when the log two ratio is zero. So it looks pretty good. We could add some colors to it but we can now kind of visualize our data. So the interesting probes, if we are interested in the difference between gonadal fat and hypothalamus, so between the two groups that we selected are these here, right? These are significantly different, 10 to the minus 10. So they are very, very, that they're significantly different but also these are very interesting. Although they are not that significant because they show a large difference between the two groups, they can still be considered very interesting, right? These ones are not, right? They show a very large difference between the groups but there's no real significance for it. One of the things is that we also see here is that there seems to be a little bit of an inflation in the p-values because we expect this to actually run parallel to zero but it doesn't do that. Good, I think those were it, right? There were some, oh yeah, creative volcano plot. Yeah, so we did. Okay, so that's nice. Good, so we did our job. We made a really nice volcano plot and then the additional one is color the dots of the volcano plot by the distance to the origin of the plot. So, but that's something for you guys to figure out or at least to play with. Like I generally don't answer additional questions although they might actually be in the, no, there's no playing with colors in the original answers either. So I'm just gonna add my volcano plot to the bottom, right, which I did here and then this would be question 13. So, hashtag Q13. All right, those were the assignments. Just for you guys, like a little bit of data to play with, a little bit of like, how do I do this? And of course, the struggle is always there, right? I've been programming in R for 10 plus years and even I still have these moments where I'm thinking like, oh, okay, have what's going on? Have why is there an error? Sometimes because a black bracket is not closed and in this case R, or a notepad plus plus can really help you, right? Because of the bracket highlighting you can just put your mouse behind any bracket and it will highlight the opening one. So when I put it behind, it directly showed me. Obi-Vanck, thank you for following. We did a little bit of a t-test. Unfortunately, the groups that I had written down into the assignment did not make sense. So we switched to a different group because we need to have like a reasonable sample size before we are allowed to do a t-test. We made a little bit of a heat map, calculated some overall means, and then how we did this filling in of these three steps where we do the mean calculation, the log two ratio, then the p-value. And in the end, we end up with a relatively fancy looking plot, right? The nice volcano plot. And of course here we can then start adding colors and doing all kinds of other. Interesting stuff, right? And that's it for at least the assignments. So I've been talking for over an hour, so I'm going to stop the recording. Guys, on Twitch, if you're here, thank you very much for being here and stick around because there will be a part two and a part three. If you're watching this on YouTube, then buy for now and probably the next video will come very soon since these are the answers and I will be also uploading the lectures. So see you guys.