 Alright, so welcome everyone, also the people on Moodle, welcome welcome, we are just going to start. So today we will have a lecture about standards for analysis. So there's a lot of like kind of standard things in my mathematics. So I just wanted to run through a couple of these things. So of course again please register for the exam if you haven't already. Very important, I haven't gotten a feedback from anyone yet that they were unable to register. So I guess everything's going well there and I hope that everyone can find it in Agnes and can register. So if you have any problems registering then send me an email and I can contact the Prüfungsbüro and of course if you don't register, I think it's two or three weeks before the date then you are unable to participate. So if you're planning on getting a grade for the course then do register. Alright, so the overview for today, first I wanted to talk about different types of files and some formats which are commonly used in bioinformatics like comma separated files, FASTA files, FASQ files, GFF, VCF and PET files. There's a lot of different other formats like SOM and BOM and these kinds of things but they are generally very complex. Someone asked me to add PDB to it and I wrote a PDB parser like a couple of, well not a couple of years but I think around a year ago or something and the format is described in like a 160 page document so I thought that that was a little bit overkill so I don't think that it makes sense to kind of discuss that. So after we've discussed all kinds of different file types and stuff of course we will talk about some coding things because also there are some standards so I wanted to talk about the difference between things like unit tests, regression tests, integration tests and these kinds of things and then I want to say a couple of words about test driven development and agile development. It doesn't play a big role in academia but sometimes it's good to know what's out there and how you can kind of structure your programming work in a more structured way. I wanted to talk a little bit about documentation so code documentation mostly and the different types of code documentation which are out there and then I added a part which I stole from the R lecture or stole, well it's not stealing if you made the original one right so I kind of auto plagiarized it for myself and that is about R packages so I'm going to teach you guys how to make an R package and hopefully we will be able to do that after the first break, second break but of course like always before we start let's have a look at the assignments. I hope everyone did the assignments, I'm not so sure everyone did because I didn't get any questions. I did get someone emailing me but not about the assignments, that was about wanting to have the time travel paper from Kerry Mootis which is a really nice paper to read so if you're interested just send me an email and I also had someone who wanted to know about one of the papers written from freely available geodata sets so I also send that so I might actually upload the Kerry Mootis paper to Moodle and make it obligatory. I was very busy this week and I will do it this week, yeah that's what I say as well when I don't want to do stuff. We could actually do a prediction about that, did Commando do the assignments this week yes or no but we're not going to do that but no, yeah it is important to do them, it's the only way you can learn programming and since like last week the lecture had a lot of programming of course if you want to learn how to program you just have to sit down and get stuck by your head against the wall Google for half an hour and then if you can still figure it out then just send me an email but at least I will just go through the answers and I will also put the answers online and I was a little bit lax, I think I only uploaded the assignments on Monday or Tuesday so it's also a little bit my fault, like I should, so if I don't upload the videos and the other stuff to Moodle tomorrow so before Friday 5 then do send me an email to remind me because if I forget then it just has to pop into my mind again and usually it pops into my mind when I start preparing the next stream so that's generally Tuesdays or Mondays. Alrighty then so the assignments from last week were about gene expression analysis so I will show you guys my notepad plus plus window, this is of course not, this is the engine so again when you do some coding and when you are coding always make sure that you add a little header, so just say in one line what is the content of this file, who made it, hey you could add a copyright statement or when it was created but at least have something on top of your file so that people can see who did it, who they should contact with questions and what is in the file. Alright so the first thing that I did was load the preprocessed core library, it only comes up in like question number of five or six or something but the thing is that when you write code then the way that I always structure it is first do loading of all external libraries then setting my working directory and the reason why I do this is because then you can directly at the top of the script see which libraries are required. So if you would put it like all the way when it is needed which is probably somewhere around here, hey if you would do something like this then it is kind of hidden, so you don't want to hide things if you want people that use your script to know what they should install beforehand. So where did I store the data, well the data is stored in my case in this folder like D drive, project, lectures, blah, but of course on your hard drive that will be different. And then there were three files that you should load in, I think the first couple of questions could be done without loading the probe annotation matrix but there you need at least the array data in the arrays. So again here R provides this really useful retable function, of course you have to look into the file right so you take the array data.txt and you have to open it in notepad++, let me actually do that for you guys so you can see how the file looks like. So let's just open up the array data. So what I generally do is when I open up the file and I always use notepad++ or an editor which allows me to view kind of these kinds of symbols right, you see these little arrows here that means that they're tab characters. So if you would have spaces then those would be little yellow dots and the tabs are these little arrows. So it's directly obvious that this is a tab separated file and so of course here we have to specify that the separator used in this file is a tab. There is a header right so if I look at array data then I see here that there's kind of a header description which looks different from the other lines right so the other lines they have like numbers in them or sequences but here you don't have that. One of the things that you can see here is that the first column header is called sequence and then when you look a little bit further down you see that sequence is actually the second element right so it's not the first element. So that means that there are row names in here so you have to specify row names is one right because every row has a name and the name doesn't have a column header so in this case setting row names to one. Had the same thing holds for the arrays and here I added this called classes is character I don't know exactly why I did that let me see oh yeah because I don't want these numbers to become numbers because they are animal identifiers that we use right it's an individual ID so had the problem here is if I would say that load it in and convert it to the best possible type then the last column would be converted to either integers or factors and I don't want to do that I don't want to use factors in this case so I'm going to specify call classes and head that allows you to specify for each column what the class is but I'm just going to say character which means that it forces all of the columns to be of the character type and I'm loading the annotation matrix same thing just open it up view into the file see that there's a header has a set header to true and the separator is a tab so let's go to R and load it in and make our first box plot and this already I think is kind of a little trick but first let's go to R and let's show you guys how these different data files look so we're just going to read them in the array data takes a little bit of time and you can then use the head function to show like the first couple of lines right so we see here that indeed it has kind of loaded it incorrectly and we can see that there's like a bright corner let me see that's annoying so let's switch to R so you can see that it have when I when I look at the head of the array data I can see that it's more or less fine right it has row names because these things are repeated for each of the rows that I have if they would not have been repeated then there would be another column and then it would have been named sequence but that's that's incorrect and then here you see that there's a bright corner and a dark corner so those are the positive and the negative control of a micro array so the bright corner is always on and gives you kind of the maximum intensity value for each of the samples that you had and then the dark corner is the minimum intensity value of the micro array so these are just built in positive and negative probes which are either always on or always off has so you get an idea of what the range the dynamic range of the of the micro array is all right so let's look at the arrays and just show a little bit so this is just the description of the array so it has the original file name this is just the file that we got from the company so you can see that it's done in September 2009 and these kinds of things we have something called comp ID and this is the ID that the company assigned to our sample we have the strain so the strain here is either the Berlin Fett mouse or an F1 which is a cross between the b6 and the Berlin Fett mouse and if we want to know how many strains there are we can just say well show me the strain column and then it will show you all of them so we see that there's also b6 ends in there then we have two types of tissue so we have ht which stands for hypothalamus and we have gf which stands for gonadal fat and then of course we have different individuals which are marked by the individual ID from the mouse house so that's kind of how this file looks like and then we had the annotation matrix so let's show you a little bit of the annotation matrix so the nice thing about ours that you can just type in like a couple of letters and then press the top key and it will automatically like fill in the Meravigliata thank you for following me thank you thank you I love that feature that it gives me a little sound effect when someone presses the follow button I'm glad that I built that in so yeah thanks for the follow I hope you're enjoying the lecture so looking at the annotation matrix we see that it has the different probes on there where they are located on the genome what the which gene they are targeting and then with the gene a little bit of description and then the location of the gene that that is being targeted so that that's the kind of the dimensions right so the the arrays themselves so had this this file here it has like columns and rows so the rows couple to the annotation matrix and the columns they couple to the arrays and then when I want to make a box plot of this file of course since the first column is called sequence and it's not plottable right you can't make a box plot of characters so when you make a box plot and you just say box plot of the whole thing right so I would take my array data and I would just say box plot array data then I would probably get an error or it will make a box plot but hey you see that it will give me an error so kind of to prevent that let's go back to notepad plus plus so kind of to prevent that I'm saying well when I look at the files I see that the individual or the columns of the of the array data file are made by the company ID because the company gave us back the IDs so I can select from the arrays the company ID and then use this to index the array data and this will of course only select the real samples that we send and it won't select the sequence column so when I then try to create a box plot and I do LAS is to to kind of flip the axis and then of course I'm able to make a nice box plot or a nice box plot it's a box plot that kind of shows me what's going on so it takes a while to make and what you see is that that there the numbers range from around zero right so the lowest intensity is in in the order of zero and you see here that the highest intensity varies a lot for the different arrays and today you can see that these are all very very big numbers hey you don't really see a box for a box plot you just see that there's a massive amount of outliers which are on top and this of course has to do with the fact that when you get intensity data scanned by a laser then this intensity data of course is not a normal distribution and normal distribution would show up as a normal box but since the boxes are more or less all the way squeezed to the bottom you know that most of the genes are off right because there's no intensity and the genes that are on they look like outliers because they're they're very very intense compared to the average of the genes that are off alright so then the first question or I think the second question is is or the first question was use the dim function to view the different dimensions how many probes are there in the array data file so if we look at the array data file right so we just saw that the probes are on the rows so we see that there are five thousand five hundred fifty five thousand eight hundred and twenty one probes on the array there are seventeen columns one of them is sequenced so there are sixteen samples in our data file take a closer look at the header of the array data file is there any structure to the column names so let's see so we just look at the head so is there any structure to the names well they're kind of is because the names are made based on the tissue so HT and then the animal ID so that's kind of the structure and of course there's a sequence but the sequence that doesn't and the sequences of course just the sequence of the probe alright question number two use the plot function to plot the HT 2010 column of the array data file so I just skip that in the answers but we can just select of course this column and then we can use the plot function to just plot that so this will take a while as well and then we see here the image right so here we see the here we see the fifty five thousand probes and for each of the probes we see the intensity so you see that most of the genes are actually not expressed or most of the probes are more or less not giving you an intensity signal and you see that the genes which are expressed are all the way at the top but of course when you would make a histogram of this so instead of using plots you can use his for a histogram hey you would see that this is definitely not a normal distribution it's it's more of a kind of Poisson distribution but it's not really Poisson because they're not real numbers alright print parts of the other files to your our session and figure out what the different columns and rows are so we already did that select only the columns containing data from the array data file in other words do not plot the probe sequences use the buck plot function to visualize so that's the one that we did before right so how we take from the arrays so this is the description of the arrays we take the company ID column and then we use this ordering and select the individuals from array data and then make a box plot using this LAS is to and I use the LAS is to just so that the names of the of the different samples are like this right so that they are vertical and horizontal alright then the next question is look use the log to function in our to transform the micro array data save the resulting matrix into a new variable called log to array data all right so let me switch back to note that plus plus so the way that I did this is to just say oh I didn't save it in a new variable I just overwrote the old data which of course is also possible but you have to reload so I generally don't like these kinds of structures right so what I do is I take the the same as what I did the box plot on before I do the log to transformation and then I put it back into the original data frame but of course this is a destructive operation so that means that once you do this you cannot go back and that's of course not the best way to do this so a better way to do this would be to follow the assignments and say well I'm just gonna store this in something called log to array data and then I'm just going to plot the log to array data and now it's not destructive now I just make a new variable one of the drawbacks of courses is that I'm now just copying the whole matrix with 55,000 probes and 16 samples but depending on if your computer has a lot of RAM memory and this is not an issue but it could be that if you're very low on RAM memory that it's better to do the destructive operation and just copy paste it back in to where it came from because of course this doesn't duplicate the data because you just override it so I'm just gonna keep the answer that I had so I'm just gonna do the destructive operation normally I wouldn't do that but for this case like it doesn't matter too much and so we take the log to and then we store it back and then we make a box plot again so let's go to our and show you guys how that looks and now of course we see a very common structure so now we see that the minimum intensity is slightly above zero that's just the way that that it works had there's no real zero here because had we're looking at intensity so if you shoot a laser at something you will always get some intensity it's it's never exactly zero that is a really really nice emoticon test this out I like the emotes that which has all right but then what we see here is that the first four samples have a slightly higher mean compared to the next four samples then we see another four samples which have a slightly higher mean and then so it seems to be that the average expression of the array is kind of coupled to the to the tissue that we're looking at so that directly brings me to another point when we start doing the normalization procedure it might be that in in the brain in the hypothalamus there are more genes active than there are in gonadal fat right because if there are more genes active then you would expect the the average expression to be a little bit the average expression to be a little bit higher and so but you see that some arrays have outliers and you also see that that every array is slightly different from the other one so have to get rid of this and because this might be due to the fact that there's some real biology going on but it might also be due to the fact that there is some that there is some technical variation right it might be that the hypothalamus samples just had slightly higher concentrations because extracting DNA from hypothalamus might be more efficient than extracting DNA from gonadal fat or they're doing DNA to RNA so it might have a biological origin but this variance might also be technical so of course we want to apply a normalization to get rid of these kind of technical variation that we see or the large amount of variation all right so create a boxplot of the unnormalized data we did that and save the results yeah so create a boxplot of the unnormalized but log 2 transformation transform data so what do we observe we we observe that there seems to be kind of a slight kind of correlation between the fact that hypothalamus is slightly higher expressed or as a slightly higher average expression meaning that there might be some more genes which are active in hypothalamus than there are in gonadal fat but of course we can't really do anything with with this observation because we have to normalize our data first so for the normalization we are using the pre-processed core library so let me load the library so just that I have it ready and then the next question is load the library normalize the data using the normalize quantile function save the resulting matrix in a new variable select an appropriate name so let me see what I answered there so let's show the notepad plus plus window so here I do the normalize quantiles right so I do have a little header telling me what I'm doing and I'm actually just not adhering to what I said again because I'm using a destructive transformation again right so I'm again copying it back into the to the original matrix yes so you could save it in a new variable which is advisable but for some reason when I wrote the answers I didn't care too much about destroying the original data or being able to go back one step in the analysis all right so let's use this function and do the plot again and then we see this right so after we've done the normalize quantiles we see that every array has the same minimum value it has the same maximum value and all of the means are the same but also the standard deviations surrounding these means have all been harmonized to be exactly identical to each other can you explain the difference between normalization and Windsor rising so Windsor rising is looking at your data values and then fixing little errors like comma errors right so for example if I'm measuring a mouse and I'm measuring a mouse to be 10.1 centimeters then sometimes it ends up being written as 101 or 1.01 in the data right because a lot of times when you're in a lab and you're measuring things then you're just writing them down on paper and that of course creates like a 5% error rate because every time that you write something down you can make a little error so when you're looking at your data right and you see that one of your values is a mouse which is 10 times bigger than the average or 10 times smaller than the average you always have to start wondering is there a comma error right so a comma error means that you're then Windsor rising this one value into the range in which you expect it to be right you don't expect one of the mice to be a hundred centimeters right that would have you would have noticed that during the experiment so so Windsor rising is more or less a looking at your data by eye and blotting out values that you either don't like or putting values in the normal range so normally Windsor rising which was invented by the Earl of Windsor is something that you do to kind of get rid of mistakes in your data well normalization is something that that is done not to get rid of mistakes it is done to harmonize data right so in this case we are normalizing and like I told you guys last time there's two different ways of normalizing and one is normalizing internally and the other one is normalizing between and so since the idea here is to compare these arrays to each other if one of the arrays would be from 0 to 10 and the other array would be from 10 to 20 at the intensity values then of course every gene on between these two arrays would come up as being different and of course that's not true because like if we look at a grenade or fat sample and we compare it to another grenade or fat sample then we don't expect all of the genes in the genome to be different right so so normalization generally is a technique that allows you to get rid of unwanted variation which sometimes means that you get rid of some real biological effects as well well Windsor rising is looking at your data and really removing things which are obvious errors one of the things which also falls under Windsor rising which I would not advise people to do but I've seen people do it is that when you have your data measured right and you have a missing data point to put in the average on this missing data point and I would never do that because statistically speaking it does not really make sense to replace a missing value by a value which is which is the average and it can actually hurt you in the long run but Windsor rising is not really a normalization technique it's just a technique to get rid of them but when we now look at our data we see that that everything is the same mean but of course before we said the observation is is that the hypothalamus might have more genes which are active than compared to the gonadal fat so we we by doing this normalization step right we might have removed some really really important genetic variants or some real biological phenomenon and this is of course one of the one of the issues with normalization since normalization is kind of a blunt tool you can normalize away things which might be really interesting so you always have to be careful when you do normalization and you have to have a good reason to normalize but test since we are wanting to compare arrays with each other we can't have the fact that one array has a completely different scale than the other ones so again create the boxplot of the now locked and forward was it look normalized like in the lecture yeah this looks pretty normalized alright question 10 when looking at the other files we observe that this data is coming from multiple individuals and as multiple tissues measured first we want to know something more about the relationship between the different arrays so the first step what I generally do when I get back microarray data is to look at the correlation between the different samples right so the question number 10 is use the correlation function and create a correlation matrix save the matrix in a new variable and then question number 11 is use the heat map function to plot the correlation matrix as a heat map what do you see in the heat map all right so let's go back to the script yes so I'm doing this in one go because I don't want to kind of store it and define a lot of variables so I'm using the correlation function so I'm correlating again columns which have data and here you see that I use the Spearman method so I think that most people know that correlation comes in several forms so you have Pearson correlation and you have Spearman correlation Pearson correlation you can use when things are normally distributed if you have a distribution which is not a normal distribution you would want to use Spearman correlation because it uses a rank method so it's not sensitive to outliers and why did I choose Spearman well that has to do a little bit with how the box plot looks right so if we go back and then we see that it is not a normal distribution right the distribution on the bottom of the arrays is different from the distribution on the top of the arrays so you see that there's more or there's a normal I can show you this in a different way as well so let's just make a histogram of one of the arrays right so just take the first one and make a histogram of it and then you see this right and this is definitely not a normal distribution so you don't want to use Pearson correlation to compare these things you definitely want to use Spearman correlation and you can see here very clearly that some of the values are between 0 and 1 which is a little bit awkward because that's just the way it's a locked to transform data set but you see that most of the genes like more than 15,000 genes on the first array they are not on they are not working at all so they are off and then you see that there's a kind of an expression to the to the upside so there's not a normal distribution because there's not a normal distribution I have to use Spearman correlation instead of Pearson correlation in this case it wouldn't matter too much I think I think the answer would be very very similar but be aware that correlation comes into two or three different flavors depending on on what you on on on what distribution you are looking at all right so let's do the heat map of the correlation matrix this might take a little while but let's just do it so we see this picture right so and from this picture we learn a whole bunch of things because we can now say how our data quality was and if everything went well so the first thing that we observe in the clustering and so if we look at the two the dendrogram on the top or on the other side and what we see is this that on average hypothalamus samples are a lot more similar to hypothalamus samples than they are to go native fat samples right so we learn here that we didn't switch one of the cups we didn't on accident we didn't put a hypothalamus sample into a go native fat sample and we didn't or and we didn't swap a go native fat sample with a hypothalamus sample right so we can directly see that and we see that there is that that it's very clear that that the samples are assigned okay right they could all be wrong right all of the go native fats could actually be hypothalamus but at least internally there's an internal consistency here why didn't you use candle correlations that are spearmint for the non-normal distribution well spearmint is for is a rank-based method so it's non-parametric Pearson is also not a person is a parametric method so you can only use it for and candle towel it's also non-parametric one and it's slightly different but I think candle is preferred when you have lots of missing data right I think that's the difference between so Pearson is parametric spearmint is non-parametric optimized for when everything is there and then you have candle towel correlation and that is optimized when you have large amounts of missing data but you could use candle as well our provides three different ways of correlation but since there is no missing data because how you shoot every little dot with a laser so every little dot gets an intensity yeah I thought candle is better when you focus on rank-based method I would say that probably here dr. Spearman would disagree with that statement and your statement would be preferred by candle and that's of course if it's a it's a flavor right like there's many different you also have like by variant correlation methods you have so there's like hundreds of different correlation methods and all of them have their own advantages and drawbacks so I'm like someone like me would prefer a easy method and the easy method or the thing that what what I like a lot is that spearmint correlation is Pearson's correlation on ranked data so they are kind of interchangeable in a way but it's it's up to you which correlation you prefer and it also depends on your distribution but normally my kind of way of looking at it is if I have a normal distribution I use Pearson if I have a normal then I tend to always use Spearman but you could use candle as well and they probably are more or less the same right we can we can actually just check that so let's show the correlation matrix like this that's too big so let's look at the first five versus the first five right so this is using Spearman then we can use Pearson and then we can use candle I think I can just type Ken and here you see the drawback of candle correlation right it takes a long time to do this it is a very heavy method it might be more accurate but like the computational time doesn't really weigh up to the we're still waiting so which is not bad right if you're a bioinformatician you're quite used to waiting but this takes too long like it this is not a very valid method to use let's get a call I already have a coffee so we can we can wait a little bit but in this case if you would do all three methods would give you more or less the exact same same answer I'm thinking that this is just going to crash our it is actually already in a non-responding state so oh you can't see that you can see the header of the thing but alright so it's just freezing freezing freezing yeah cancel it that's easier said than done because it totally crashed alright so let's not do that let's use the task manager to get rid of the thing and then open up a new our window which is just like pain in the ass and have to resize everything which I already did at the beginning but I have to redo now so let's resize it like this and then resize it and then resize it a little bit more still like a couple of pixels and like yeah there we are alright so let's reload all of our data and make our heat map again so Kendall correlation seems to be preferable when you have non-parametric data and your data set is small I think that would be my my my observation here but the heat map shows us that indeed to have conatal fat is similar to conatal fat hypothalamus similar to hypothalamus and so we learned that we didn't mix up any samples one of the nice things that I that I that I think you can see here in this this is that there does seem to be three groups right because we know from the data that we had like the BFM is we had to be sixes and then we had to mix between the two and we can see here that these here are these three samples here which cluster together they are probably the BFMI they could be the B6 as well and then you see here another three samples why am I missing a sample here one two three four five six seven eight huh that's interesting and but there seems to be some grouping as well which we would expect right because there's there's three different types of animals in our data so we we would expect to kind of see these groups back as well but these groups are not as clear so we can't really discuss about that we might be able to see it when we would look at a single tissue and instead of here plotting the name of the sample plotting the species of the sample hence so then we might find out that these three are indeed related to each other we this is the pattern that we would have expected from a data set which looks like this all right so the next one is for each probe on the array calculate the overall mean on the log 2 normalized data using a for loop all right so I think that this was one of the harder ones although the answer is more or less given I don't know why I gave you the answer here that's an interesting one I already know what's wrong yeah I'm showing you guys the answers for the bioinformatics for the for the R lecture so the R lecture is slightly different questions that's not that's not what I want but anyway let's let's just do question number 12 right so I'm just gonna say q12 and I'm just gonna copy paste the code so what is happening here so I'm going to want to calculate the overall means for each of the probes right so I'm doing overall means is no or I could do something like this so is an empty vector and I did not do any computation yet and then I'm going to go through the rows of this normalized data which I didn't store so I'm just gonna have to store it here so I'm just gonna take out this part because I overwrote it right so I'm just gonna declare this variable so I'm going to go through each of the rows of through each of the probes I'm going to calculate the mean and then I'm going to add this mean to the list of means that I created so and this will go through and it will calculate all the means so let's go to R and then define that so I didn't define the variable yet so I'm just gonna do that here and then I'm just gonna calculate the overall means so this will take a little while right because it has to do this 55,000 times but then in the end we will get an overall means and of course if you're doing this on a laptop it will take some time but like I said that's bioinformatics for you all right so we get 50 warnings so we definitely want to check the warnings so it tells me that for some of the computations of the mean that it did it encountered non-numeric or logical value so it returned an NA so we can we have to figure out where that happens right so if we look at the overall means then we see that it actually happens everywhere so what went wrong so the question is what went wrong so first let's look at the input data set that we're using and then we see that here we should be able to calculate a mean of the first row so we do a mean and then it gives me NA so for some reason when it loaded in the data set it did one of these R things right and R is really good at screwing up your data especially when it comes to things like factorials and numerics so if we would ask for the class of this then it would say data frame that's logical so the first number here is actually a numerical value so the reason why it doesn't do if I would unlist it then I would just get the numbers and now if I would ask the class it's still a numeric value so it's a little bit weird that I cannot calculate the mean directly because I'm expecting that it would and if I do an unlist on it just to make sure that it's not a data then it works that's kind of interesting so the mean function here doesn't really work because it internally stores probably the numbers not as numbers but it stores them probably as factorials so to prevent that I'm just going to change the code a little bit from the code that we had and I'm just going to say well I know that when I unlist them then it then it definitely takes them out of this little sub matrix that you have which just as one row and then we just have values so let's try this again let's go to R and now use the unlisted version to see if we can calculate the means and that should be actually a lot quicker because computing and getting this warning is actually slowing it down quite a lot so it should be relatively quick in calculating all of the different means of each of the probes all right so no warning so everything should have been going correctly so here we see the overall means so we just look at them like this and now of course when we plot these we can see again for each of the probes and we can now see the mean expression of the probe and this already looks different when compared to when we looked at an individual array right so it seems that that genes and because now it seems that more genes are active and this is of course because the genes which are active in the brain generally tend to be not active in the fat and the genes which are active in the fat tend to be not active in the brain right so we see that on average there's much more probes that should have been expressed if we make a histogram of the overall means and we should now also see that this is the case that there's still a lot of genes like 14,000 which are not active but we see that this this this hump here is much bigger so there are more genes which are active if I just look at each of the probes and see if they're active in one of the two tissues all right then the next question is choose two groups which you think might be interesting for example the F1 gonadal fat and the BF my gonadal fat we can select from the array annotation the company ID or the atlas ID of the arrays in this group we can use this to extract the correct columns of the data so it had this little example so let's just copy paste the example into notepad plus plus so we go here oh I already had this so I did have that one all right so here what we do is we we say which of the arrays strain is F1 and which of the tissues is gonadal fat and then of course I can extract these from from the arrays right because now I'm getting the indexes in the arrays and then what I'm doing is this then I'm taking these rows and I'm only taking the company ID right so F1 gonadal fat samples so I'm saying take from the strain column only the end the things where it is F1 and in the tissue column it has to be GF same thing here from the strain column take BF my from the tissue column take gonadal fat and then take only the company IDs of these individuals all right so let's see so let's go to R show you guys the R window right so we have array array F1 and then we have a GF right so these are the two samples which so there's two F1 samples for gonadal fat and then if we look at the Berlin fat mouse BF my gonadal fat then we see that there's actually four gonadal fat samples of the Berlin fat mouse all right then we can now use the array F1 to extract the correct columns from the array data do this for both groups so of course when I look at the array data and then I can select the first 10 rows and I can select for example the array F1 gonadal fat sample and then it will just give me these two and of course I can do the same thing when I look at the BFMI right so in this case I get four columns all right then the next question was a more complicated question because now I wanted you guys to compute different things in one go right in the first four loop we only computed the mean and remember that but then now the question is calculate for all of these probes the mean of the two groups the group that you selected for example the BFMI gonadal fat and the F1 gonadal fat the log two ratio between these means and the p-value using a t-test and the t-test is going to go wrong here because t-test you have to have I think at least three samples and I think that that will go horribly wrong but Ted the let me show you guys and so the first thing that I'm going to do is I'm going to create variables which will hold my answers right so I'm going to create a mean F1 gonadal fat right so this will hold this will hold the means 40 F1 individuals gonadal fat going to do the same thing for the BFMI so I'm going to say F mean BFMI gonadal fat I have my log two ratios initially I haven't computed anything yet so they are null and then I'm have my p-values which initially are also null right then I'm going to go through the number of rows of the array data and then I'm just going to take this one element take the F1 individual take the same probe take the BFMI individuals calculate the mean and then store this as mean one and mean two then I'm going to divide mean one by mean two and take the log two this is then the ratio and then I'm going to do the t-test which probably will fail for all of them because I have only four individuals versus two individuals but I'm going to do the t-test anyway and then I'm going to calculate p-value and then I'm just going to add the things that I computed to the correct variable that is going to hold it for all of the probes so I'm going to add the computed mean to the means of the F1 the computed mean to the means of the BFMI the ratio to the log two ratios and the p-value to the p-values and we're just going to run this I don't think that the that the t-test will work but if we're lucky it does if we're unlucky we get just p-values which are all NA but at least like we try to answer it so probably if we would have taken two different groups if we would have taken the B6N versus the BFMI then probably it would have been three versus four or something like that and this of course will take much longer because now every time that we look at a row for this row we take the F1 individuals we calculate the mean we take the same row take the BFMI individuals calculate the mean calculate the log ratio then do the t-test so hey it's going to be kind of heavy and it's going to have to do more computational work than in the first loop so again we can take some coffee and we can just let this run and I'm very impatient so I'm just going to quit it halfway through so I'm just going to press the stop button and we're just going to see how far we are so we were almost done so I was a little bit too impatient because we stopped it at 52,000 out of 55,000 doesn't matter too much because now the idea was to create the volcano plots by doing the minus log 10 p values against the log 2 ratios right so we have the log 2 ratios which look like this and then we have the minus log 10 p values but we only start p values so these are all the p values so now we have to do the minus log 10 of the p values and we want to plot this so we are going to put the p values are on the y-axis so we're just going to say y equals minus log 10 and x equals and that's the ratios log 2 ratios log 2 ratios so the x-axis log 2 and the y-axis is going to be the minus log 10 p value and now we see kind of the characteristic volcano plot right you see kind of the zero line so here there's no difference in the ratio and here we see the minus log 10 of the p value right so which genes are interesting genes to look at those are the ones which are all the way over here and because these have a big difference between the F1's and the BFMIs and they are also very significantly different right this is 1 times 10 to the minus 6 this is 1 times 10 to the minus 5 right so the interesting genes are more or less in this area when we look at the up-regulated genes and this these genes are more or less the interesting genes when we look at the down-regulated genes all right and then there was an additional question to color the dots of the volcano plot by the distance of the origin did I do that no I didn't color them but you you can do that right you can you can just compute the Euclidean distance from each of the points towards the 0 0 and then hey you can color them based on that error 404 what's not found to test this out of what are you missing I will make you guys a really nice colored volcano plots in the additional question but this is more or less how you would look at it came up on R on my laptop you got a 404 error on your laptop on R that seems strange like R generally doesn't give 404 errors is that a 404 for loading the data because that can 404 with a file not found then you have to download the proper files and set them in the correct working directory you do have to extract a zip file though I had some people in the past trying to directly load the zip file into R that does work because R it has this zip I think or G zip so you have a zip file extract but you can use this zip function to directly load from zip files if you wanted to but I would advise you to just extract the zip file and then go and set your working directory to where you extracted it but this is more or less how you look at microarray data right so and it's just a very very basic introduction let me look the recording is now going on for 54 minutes and we only discussed the answers which is perfectly fine so let's go back to the PowerPoint so for everyone watching on Moodle you are going to miss one of the amazing breaks so the break is going to be pandas I think I think the first one is pandas all right so I will stop the recording