 everyone. If you're watching this in YouTube, welcome to part number two of lecture number 12, the analysis of microarray data or gene expression data. If you're watching this on Twitch, then thanks for still being here. So we've been talking about normalization, how to do statistical tests and all of these things, but the problem comes in with microarrays because in microarrays you're generally doing 10,000 or 100,000 genes that you measure, right? So you don't do a single statistical test, you do literally 10,000 or 100,000 tests. So in that case, you have to realize that you have a multiple testing issue because when you do your analysis, every test that you do, you say, I want to see the p-value being smaller than 0.05. And that of course, when you do 100 tests, that means that five genes will show a significant difference, but this is just based on random chance because every test has a 5% or as a 5% error rate. So if I'm doing 100 tests, then of course each of these tests has an individual error rate. So the problem comes in and you have to correct for that, right? So you can make a type 1 error, which is very common because you're doing 10,000 tests. That means that you call a gene significantly different, but it's not, right? It's just because your threshold is too low. So you can avoid this type of error by doing a Bonferroni correction. The other error is a type 2 error. That means that you are missing a significantly changed gene, right? So you're saying that based on my statistical test, this gene is not differentiated expressed while in reality it is. So and you can avoid this type 2 error by doing a Benjaminie Hochberg false discovery rate adjustment. And of course, you can only optimize for one of the two, right? If I optimize and say I want to minimize my type 1 errors, then of course I'm going to like have to accept the fact that I'm going to make more type 2 errors. And if I'm minimizing my type 2 error, then I'm going to make more type 1 errors. So in R, when we don't want to do multiple testing adjustment, we can use this p.adjust function and the p.adjust function has three parameters. The first parameter is the p value that you got. The second parameter is the type of adjustment that you want to do. For example, I want to minimize my type 1 errors. So I'm going to say Bonferroni. And then the third parameter is the total number of tests that you did. So in this case, I say that I did 10 tests, right? So what this statement does, it says correct the p value for the fact that we have done 10 tests in total, right? And the number of tests performed normally corresponds to the number of probes or the number of genes that you have measured on your array. So this third number will generally be in the order of like 10,000, 20,000 or even 100,000, depending on which microarray you're using. And the nice thing about the p.adjust function is you don't have to do this one by one, right? You don't have to write a for loop and go through all of the p values adjusting them one by one. You can just give it like 100,000 p values in one go. And then hey, you say 100,000 and it will adjust all of the p values for you. And you don't have to worry about the fact that you have to write a for loop. So good. So now we know, right, which genes are really differentially expressed because hey, we did our microarrays, we scanned them, we did background correction, we do normalization, then we do statistical tests, then we adjust our tests, and then we are left with a list of genes. Hopefully it's a small list of genes, and all of these genes are differentially expressed. And now we want to know if they have anything in common, right? So the next step would then be either do gene ontology, or use like a pathway analysis. So pathway analysis, we already discussed in lecture six. And that would mean that you take CAG, and then you look at the pathway that you're interested in, and then you say, okay, so now take this CAG pathway and over plot the data that I have onto it, right? So for each gene, I know if it's up regulated, I know if it's down regulated. So color the up regulated genes green and color the down regulated genes red. And then you can reason about what happens in the pathway. Another way of doing this is gene ontology, right? Imagine that I have no idea with pathway might be involved or which biological process might be affected. Then gene ontology is something that I can use to figure out in which direction I should look. So gene ontology is a project which is a collaborative effort to address the need for consistent descriptions of gene products across databases. So gene ontology has three different pillars. So gene ontology comes in cellular component biological process and molecular function. So the cellular component has so every gene has these three annotations. So it has an annotation saying that this gene is located in the endoplasmatic reticulum, or this gene is active in the nucleus. It can also tell you, for example, if there's a gene product group, right? The cellular component is also, for example, the ribosome, or it's a proteasome, or it's a certain protein dimer, right? So cellular components are where in the cell is this gene normally found to be active. Biological process is another tree, which is subdivided in all kinds of sub divisions of biological process. And a biological process is defined as a series of event accomplished by one or more organized assemblies of molecular functions. So a biological process is not really equivalent to a pathway, but it is very similar, right? A biological process might be DNA replication. Molecular function is, it describes the activity of the gene that occurs at the molecular level, and some molecular functions generally correspond to activities that can be performed by individual gene products, but some activities are performed assembled by complexes. So it's more or less, it's very similar to a biological process. But the molecular function might be breaking down of amino acids. And the biological process that belongs to that is degradation, right? It's the degradation process or the protein degradation process. So molecular functions generally are smaller than biological processes. So how does this look? Well, if we look at the biological process tree, then here we see all the way on the top, we see biological process. So gene ontology set, if we look at biological process, there are two different biological processes. We have cellular processes, which occur within a cell, and we have metabolic processes. So metabolic processes are belong to the metabolome, right? So we have cellular processes, which are defined as either cellular metabolic processes and metabolic processes can be either cellular metabolic processes or small molecule metabolic processes. And then you see that this kind of gets split out into smaller and smaller groups. And head the further down you get to the tree, the more exact a certain group becomes. And so you see, for example, we have fat soluble vitamins, metabolic processes or vitamin K, or quiona cofactors, right? And this tree is very big. This is just a very small part of the tree. So when you do a gene ontology analysis, right? So you take the genes, so every gene in the genome has their ontology annotated, more or less. And then what you do is you do an over or under representation test on your list of differentially expressed genes. So imagine that I have a list of genes. And I think that these genes are so these genes came up in my analysis, and they are the genes which are different from, for example, treated versus untreated cells. Then how does the result from a gene ontology over representation analysis looks? Well, it looks like this. So the biological process itself is not affected. But then you see that the further down the tree you come, the more it starts becoming significant, right? So here you see that, no, the thing what all these genes have in common is, it's a cellular metabolic process. And the process which was most affected is, for example, the photosynthesis pathway, right? And that that's how it works. So it takes this list of gene looks to see, okay, so you have 50 genes in your list. In the whole genome, there are 150 genes annotated to this process. Then it looks to see if the ratio of your genes in your whole set of genes is different from the ratio of these annotations in the whole genome. So besides that, we want to not only know where these things are located, but we also want to visualize this, right? We want to when we write a paper, we have to show our readers that what is happening in our analysis, what is happening in in the experiment that we did. So this is my list of propositions for my thesis. So when I did my PhD, if you do it in Holland, then near next to your PhD thesis, which is like this book where there's like, a couple of 100 pages in there describing which papers you published and what kind of research you did. You also have a list of propositions. So one of my propositions was that there is no good communication on big data between statisticians and biologists without proper visualizations, right? Because if you are a statistician, you speak a completely different language than a biologist. And if you want to communicate together, then you then you need to have a certain language which you can share, right? So you need to create images or diagrams or animations to communicate your message as a bioinformatician or as a statistician to other people. For example, biologists, right? So and of course, this becomes incredibly important when you're dealing with like large number of genes. If I have 150 genes, which are differentially expressed, and then how do I communicate which genes those were and how how much differential expression I saw. And so then you need to use visualizations. So the most common way of visualizing things is heat maps. So heat maps show you, for example, if certain tissues are different. So here I've created a heat map where I compare the hypothalamus tissues with the gonadal fat tissue. And what we can see very clearly here is that hypothalamus has a completely different expression profile than, for example, the gonadal fat, right? Because here you can see that the yellow part means that they are more similar and red means that they are different. So you can see that, for example, gonadal fat 513 is very similar to gonadal fat 514. You can also see that hypothalamus 514 is very similar to hypothalamus 527. So if you want to make a heat map in R, then it will just take a matrix. So if you have your matrix of differentially expressed genes or you have your matrix with all of the genes in there, and then what you can say is just say heat map my matrix. So the matrix function or the heat map function has a couple of important parameters. So one of the parameters or two of the parameters are rho v and col v. So these determine if there is a dendrogram attached to your histogram, right? I could say I want to cluster both in the x direction, or in the x direction, and I want to cluster also on the y axis. And that is determined by setting rho v and col v to true or false. So if I said rho v to false, then this tree would not be clustered. Besides that, we also have the scale parameter. So the scale parameter of the heat map function says that if you say scale is for example, rows, and then what it will do for every row, it will calculate the mean and then make all of the colors relative to the mean. You can also color, of course, or you can also scale by the columns. So then the same thing happens for each column. It will calculate the mean and then the colors will be relative to this mean value. In general, I always say to people that if you do a heat map, you should first set the scale to none. So welcome back. Look at the packages. I got them. Good. So heat map, right? So we can make a heat map. Heat maps can be made sample to sample or gene if we wanted to. Here I'm plotting samples versus samples. We can see that the Pertolomus samples are clustering very well with the Pertolomus samples. Rho v col v allows you to determine if there is going to be a dendrogram on the top and on the side and if it should reorder them. Besides that, we have the scale function always set scale to none in the first instance, right? Because then it won't start rescaling based on rows or column means. And besides that, what you generally want to use is, for example, a row side color or a col side color. That means that it will add an additional bar here. And using this, you can you can color so you can add colors for different groups that you have in your data, right? So it could add a little bar here, which colors, for example, blue for males and which colors orange or pinkish for females, right? So then I can see if the females cluster together as well as the males clustering together. So it allows me to just add a additional bar with colors. And I can do that for the rows and I can also do that for the columns, of course, so I can add two different R's saying that, well, these are, for example, the tissues. And on the other side, I want to have blue for males and pink for females. Besides that, of course, like we already saw, one of the most common visualizations used is dendrograms or phylogenetic trees. So I wanted to talk a little bit about dendrograms, because dendrograms, we already made a couple, right? They are based on similarity between expression profiles. But of course, similarity is something which is mathematically very hard to capture. So there are three different distance measurements, which are all related to each other. And I write down the formulas here, not just because I want you guys to know the formulas, but because I think that if you see the formulas, you see that they are all related, right? So imagine that we have x and y. So x and y are, for example, sample one and sample two, right? So for sample one, I have measured 10,000 genes. For sample two, I've also measured 10,000 genes. So what do I do when I calculate Manhattan distance? Well, I look at the first gene, the expression level, and then subtract the expression level of the same gene in sample number two. And then I do that for all of the probes, for the first probe up until P that I have, right? So 10,000. And every time I calculate, so I just subtract the one from the other and then these mean that I make it absolute, right? So then I'm calculating something which is called a Manhattan distance. Besides Manhattan distance, I can also do Euclidean distance. So Euclidean distance is more or less the same, but now we square the difference. So we again go from one to all of the probes. We square the difference every time before summing it up. And then once we summed up everything, we take the square root of the total number that we compute it. This is called Euclidean distance. And Euclidean distance puts more weight on, or on genes which are larger expressed, right? Because the 5 to the power of 2 becomes 25. Well, 4 to the power of 2 only becomes 16. Right? So and in Manhattan distance, I would have just said 5 plus 4. And here I'm saying 25 plus 16. Of course, I'm taking the square root in the end to make sure that it's more or less in a similar range. But had the Euclidean distance puts more weight on large differences, then it puts on small differences. And then you have the generalization of this. And this is called the Minovsky distance. And Minovsky distance is more or less the same as Manhattan distance and Euclidean distance. And because Minovsky distance, where we say m is 1, reverts back to Manhattan distance. If we say m is 2, then it reverts back to Euclidean distance. But of course Minovsky distance allows us to use any scaling factor that we want. So we can say use m is 10. Right? So then I'm saying compute the distance, do it to the power of 10, sum them all up, and then in the end take the not the square root, but the 10th square root. Right? So the square root where 10 is the base. And these are three different distance measurements. But of course the Minovsky distance is just the generalization of Manhattan and Euclidean distance. So of course this now allows me to define a score saying that how distant two samples are from each other. Right? So that's what you see in a dendrogram. And that's what is visualized. The distance between two samples. And hey, you can have things which are very distant to each other, but also things which are very close. Good. So little example, when we do profile one and profile two, so this is just the profile of sample one, profile of sample two, just some numbers. We calculate the Manhattan distance, we calculate the Euclidean distance, and you can see that in the end it's not that much difference, but it can give you some significant differences depending on what you look at. So the Euclidean distance kind of skews your visualization or your tree towards like large distances having more weight while Manhattan distance puts a single difference. Just this is a single difference. All right. So when we do dendrograms, we need to have a distance between individual i and individual j. Right? So when we do hierarchical clustering, to create these dendrograms, we have a matrix of elements where each element says the distance between sample one and sample two, or sample three and sample two. Right? So all elements are of course positive, because distance measurements are always positive. You can't have a negative distance, because that means that you're more similar than equal. So of course in these kinds of distance matrices, the diagonal is always zero. Right? Something compared with itself is always exactly identical. So exactly identical is zero and very different is for example nine. Right? So and here it means that smaller distance is that you are more similar. So the profile. So across the whole 10,000 genes, if you get a value of 15, then you're more similar than when you get a value of 150. Of course these matrices, these distance matrices are symmetrical because the distance from i to j is the same as the distance from j to i. Right? So number one has a value of one to two and number two has a value of one to two one. Right? So it's symmetrical. Of course the dimension is n times n and generally we do n being the number of genes because we want to generally know if genes are showing the same expression profile or we can do it also across samples and then we want to see if samples have similar expression profile. So how do you now do hierarchical clustering? Now this is very similar to what we showed to what I showed you guys last time when you do a multiple pairwise alignment. So the start of the procedure means you search for the smallest elements in the distant matrix. Here it's the distance from one to two which has a distance of one and then we form a cluster. So we group one and two together and then we calculate a new distance matrix. And that's how we group. Right? So this is the first branch in the tree and then we get the second branch in the tree by looking at the smallest one which is two. Right? So now we would group four with five. But now we get an issue right? Because now when we have to calculate the distance of this group towards another group then how do we do that? Right? Because we have two profiles in this group and we can't really compare two profiles or the average of two profiles to another profile. Right? So if we want to calculate the distance between a cluster and an individual profile we have three different ways of doing this. We can use something which is called single linkage. Right? So the distance between a cluster and a profile is computed as the distance between the two most similar elements in the two clusters. And then this is called single linkage. We can also use complete linkage. So instead of computing the distance of a cluster to a profile by taking the two things which are most similar we can also take the two things which are most dissimilar. And then we talk about complete linkage. Right? So now we compute the distance between a cluster or between a cluster and a profile or two clusters as the distance between the two most dissimilar elements in the cluster. And then we have average linkage which is also called UPGMA. And then when we calculate the distance from two clusters A and B we take the average of all distances between pairs of objects in X and A and Y and B. So we take the mean distance between the elements in a cluster towards a profile. So let me do a little drawing. Should I do a little drawing? I think I'm going to do a little drawing since there's only Misha and Xanaxin and my moderator. Why not do a little drawing? I haven't drawn in some time. So let me just calculate this right? So let me see if this works. Let's go to draw. Let's take a pen. Right? So for free. No. Yeah, this is going to be for free. Drawings are always welcome. All right. So imagine that we have... So I'm not paid. Shut up and check my channel points. All right. Let's do a little drawing. Right? So have we have, for example, a cluster? Right? So and of course distances can also be represented on a 2D plane. Right? So if we have like a profile here. Right? And we have a profile here. Then we can calculate the distance between these two profiles. That's going to be very small. Right? So that's kind of the distance. Right? So in this distance is just the distance between these two points. Right? And then we can have a point here and we can have a point here and another point here and another point here. Right? So what happens now is if we have these five samples. Right? Each of these five samples we compute the distance and we visualize the distance like this. Right? So now first we find the two points which are most similar. Right? So these two points are the most similar. So this is going to be our first cluster. Like 1, 2. Right? So now when we want to group the next closest thing to the cluster. Right? We can take which is this one of course. Right? Now we can compute two ways of distance. And this is a little bit difficult because i actually didn't have a good so let me move one of the points a little bit. Right? So let's move the one point a little bit closer to this one. Right? So now when we do single linkage then the distance between this cluster and this point is defined as this distance. Right? Because this is the closest one. So we take the one the most similar element from this cluster and then compute the distance. So we say that the distance between cluster and this profile or this point here is the blue line. If we would take the complete linkage we would take the most dissimilar object. So that's the one that's farthest away. Right? So this line is longer than the other line. So now we say that no this is the this is the distance between the cluster and the point. If we would take up GMA. Right? We would calculate both distances and then we would take the average of it. Right? So we would take kind of a green line. Right? So the green line is the average between the red line and this point. So it ends up being well not at one of the points but it ends up kind of in the middle of these two points. Right? So blue here is single linkage. Complete linkage is the red line and the green line is up GMA. Of course up GMA is computationally much more heavy because we have to calculate all of the distances and then do the average of it. Right? But now we group and we make a new cluster. Right? So the new cluster will now include the third point. Right? And now we start doing the same thing. So now we say that have we have now one comma two towards three is a certain distance. So towards three is a certain distance. Right? And that's the distance that is being plotted in a tree. So if I go back here. Right? And it would show you the dendrogram again. Then at this dendrogram is here. Right? So then this is the distance between one and two. Right? So that's the height here at the at the graph which is from zero to 14. Right? But if we go back and we had so this is an iterative process so we go through it one by one every time again. And so if we go back to the drawing window. Right? Then now we would have defined a new cluster and now because we now need to know which of these points is the closest. So let me remove the coloring. Right? So and this is then the midpoint of the cluster. But now of course we see that we get very different distance measurements. Right? Because now if we would look at single linkage to the next closest point. Right? Then this would be this distance. Right? Because this point is the closest to this one. But if we would take the complete linkage then it would compute the distance from this point all the way to this point. Right? So it would show you a much, much bigger distance. And of course if we do up GMA then of course the distance ends up being somewhere in the middle of these three points. So then this is the distance. Right? So depending on which type of clustering we do we get a certain tree and these trees can be can be exactly similar but they can also be different. So that is one of these issues with with the computation of hierarchical clustering. Right? So if we do hierarchical clustering the way that the dendrogram is formed can be significantly different compared to when we use single linkage, compared to when we use complete linkage. Right? And that is because in one case we use the most similar elements to calculate total distance and in the other case if we use we use the most dissimilar elements. So generally when you do clustering remember that there are three different ways of building your tree. So every distance matrix that you have in the end comes with three completely different or could be completely different trees. In in many cases these trees would look exactly the same. The only thing which would be different is the scale. Right? It's an iterative process so we search for the smallest element then we form a cluster, we calculate the distance matrix and we repeat until everything is merged. Right? So if we would look at our little example we would merge one and two then we would merge four and five and then we would merge three together with cluster four and five because that has a distance of three and then we end up with two clusters and then we would merge these two clusters. Right? So here we would have a a dendrogram and this dendrogram would look a little bit like this. Right? Because if we go here let me just active. Right? So our dendrogram in this case would look like this. Just get it all away. So we would have the first cluster which is one versus two. They would be very closely related because they have a distance of one. Right? Then we would have another cluster four versus five which has a distance of two. So it has the double distance. Right? And now we would have the distance, so now we would have three. Right? And three was closest to the other two. So it would have, so three had a distance of three. So here we would have then three which would come here. So double the distance again and then the distance between all of them would be four. So we would end up with the last line connecting the two clusters which give us a distance of four. Right? So here we have one distance, this is two distance, this is three distance and this is four distance and this is the cluster of one versus two and this is the cluster of four and five and this is the one of three. Right? So this is then in the end the dendrogram that we would get and so we can remove these here and then in the end we get something which looks like this. Right? So here we would have our scale one, two, three and four. Right? So this would be our dendrogram. Of course if we use it, if we use complete linkage then the distances in this tree would be different compared to using single linkage. So that is how these things are built. This is how you make an dendrogram if you would do it by hand and of course like the clustering algorithm that you use does it for you so you can use like hierarchical clustering. But all of these clustering methods have three different linkage methods and your tree will look different based on which linkage you select. So there is no one tree that represents the evolutionary distance between all animals on this planet. There's actually three different graphs or three different dendrograms that you can make based on which distance profile you choose or which way you choose to summarize these clusters and compute the distance from a cluster to a new profile or a cluster between another cluster. Alright so then for the last part I wanted to show you a couple of historical visualizations of gene expression data which have been used to show people how many genes are differentially expressed. So the first one is the MA plot. It's a type of bland ultimate plot and I think it was invented more or less in 1980s already. So on the x-axis we show A which is defined as the mean expression value of a gene across the whole experiment. And then on the y-axis we show M and M is the log two ratio. Right so it assumes that you have only two conditions. So it assumes that you have a ratio for example you have disease tissue versus a healthy tissue. Right so and in that case had these MA plots look like this. Right so here we see A. Right so we see that here for example at 10 had that this gene here this little dot had an expression level on average of 10 and in it was almost minus one. Right so that means that this gene was almost twice as much expressed in the one sample compared to the other sample. Right and of course the genes which are interesting of course are the genes which have a relatively high expression and the genes which show a relatively high difference between the groups. Right because in the end there's two things that count. Right it's not just the difference between the two sample it's also the expression level. Right if you are a highly expressed gene in one sample and you go to be a lowly expressed gene in another sample. Right then your ratio will be will be very big but your average expression will be more or less lower because of the big difference. Right so these MA plots had they have like a two fold difference is more or less considered as significant but then you also penalize for having very low expression. Right so a very low expression level is considered to be not really interesting. You need to have a high expression level and a high difference before you become an interesting gene. The volcano plot tries to visualize the same thing so the x axis shows M which is the log ratio so the difference between the samples between the two groups that you are comparing and then on the y-axis we now not show so here we show instead of the so the volcano plot puts the y-axis of the MA plot on the x-axis and then on the y-axis we show the log score so it is this the significance right so it's the minus log 10 of the significance. So why do we use the significance here and that is because if there is a lot of variation right so then it might be that the difference between the two groups is big but if the variation in these groups is also big then it's not really interesting because the difference is not that significant right and this is something that the MA plot does not show you. So a volcano plot looks like this and these are really nice plots to make so this is one I think that Manuel did and what we see here is we see the log 2 ratio so M right so we see here for example minus 2 which means that this gene here was four times lower expressed in one sample than it was in the other sample right and then here we see the minus log 10 p value and what we see here is that the p the minus log 10 of the p value is almost zero right which means that although the differences the difference was really big this gene is probably not that interesting right because of the fact that the p value is very small right so it had a big difference but also a big variance so the genes which are very interesting are the ones that are coming out of the volcano here on the top right so here we see that we have a p value for this gene of one times 10 to the minus 15 and then when we look we see indeed that there was kind of a four fold difference between the samples the same thing here we see now a gene which is upregulated again 10 the p value for differential expression was one times 10 to the minus 14 and we see that this was almost had 2 to the power of four which is almost 16 times higher in in one of the samples compared to the other samples right so it's the volcano plots and the ma plots are a way of visualizing like large amount of data and then generally the interesting genes are the ones that are more or less floating in this area on the top so the interesting up regulated genes and the interesting down regulated genes or the other way around right because you can define what your sample is good i hope that that's clear i don't think that we're going to make a volcano plot but they are really fun to make because you can do a lot of things with the coloring and how to color them and statistics are mathematically they're they're interesting because you can hey you can use like a circle formula to kind of color all of these dots being like black and hey you can make the volcano look really really beautiful which helps getting your publication accepted good so we talked a lot about microarray data during the assignments we will use some microarray data from our group the data set that i've been talking about so the one which shows you the the hypothalamus and the gonadal fat in these three different types of mice so our berlin fat mouse the standard laboratory mouse and a cross between the two but if you want to do science at home you can do that because there are two major databases which contain literally thousands of microarrays for free which is a massive resource because you have to imagine that every one of these microarrays cost around 80 dollars so if you would if you would want to do like a hundred microarrays then you pay around eight thousand euros for that so we have a collaborative effort called gene expression omnibus which is driven by the ncbi so the american bioinformatics institute and these store around 25 000 experiments and there are 600 000 microarrays that you can download for free and all of these are relatively well annotated so it tells you this was a microarray done on humans and this was cancer tissue lung cancer taken from a patient which was this age and some other parameters right but then the drawback of gene expression omnibus is that it's storage and retrieval only so hey you can store your data there and you can retrieve your data or anyone can retrieve your data but there are no analysis tools on the website to kind of do a pre-screening or if microarrays might be might be interesting right the other website is array express they have a very very big archive even bigger than the gene expression omnibus so there's like 24 000 experiments they have around 700 000 microarrays that you can download but the nice thing is is that they have something which is called the gene expression atlas and the gene expression atlas is a curated re-annotated part of their archive data so that means again that a human looked at it and says yes this microarray was really done on a b6 mouse on brain tissue at this age of the mouse right so they they call the researchers and they check double check the data make sure that everything is uploaded correctly and that you can more or less be that you can rely that the data or that the annotation of the data is correct it has much less experiments but there are like 130 000 arrays in there and so if you want to do some differential gene expression at home like if you're interested in lung cancer just go to the gene expression atlas downloads on standard lung tissue microarrays download some lung cancer microarrays and then just compare them see which genes come up see if you can find any pathways involved in lung cancer or hey if you're not interested in lung cancer but you're interested in obesity then also you can have things like fat tissue in mice and you can have mice on high fat diet low fat diet you can get all kinds of different tissues like the liver the skeletal muscle and it's a really really strong and like massively valuable database to kind of look around and see if you can find data which which you can use and almost always there are microarray data sets available for the thing that you want to investigate so it doesn't only contain like mice and humans there's also plants in there and bacteria and so anything that you're interested in there's almost always guaranteed to be some kind of an experiment and you can get really really high scoring publications out of it there are science papers written by people that did not spend a dime doing their own microarrays but that just reanalyzed like 10 000 arrays from geo and that is one of these things is that as a bioinformatician hey you don't always have to set up your own experiments like a biologist you don't have to grow your own plants or breed your own mice because there is so much data already out there and available for free hey you can get really nice high scoring publications by just read out by downloading data harmonizing across different experiments and then drawing conclusions about what happened in the different experiments that we're looking at the nice thing about ebi is that it also has some analysis tools which makes it easier to see or to screen if microarrays might be useful for the experiment that you're doing have because you can kind of compare microarrays done on brain and mouse from one experiment to microarrays done in brain in another experiment and then you can see if they are very different from each other or if they are already very similar before you start downloading the data so this is how geo looks hey you just have the website so you can go to the website click around a little bit but it's just hey you can browse by data set by series by platforms and by samples so platforms are different types of microarrays so here you can see that there's almost 1.3 million microarrays in there so samples that have been done on microarrays array express is more or less similar so hey it has 55,000 experiments and there's a 1.6 million microarrays in there if you want to download all of the data in this case they tell you it would cost you 27 terabytes of hard drive storage to store all of the data that they are storing and you can get this for free right so head just take a basic number as like 80 dollars per microarray that means that here you have at least 1.6 million times 80 dollars that's that's something that's big so 1.6 million times 80 you end up with a data set that you can download for free which is worth or which people spend 128 million dollars on to collect and you can get that for free it's like insane like they are giving away 128 million dollars and you can just download it good so that's the end of the lecture I'm very sorry for the for the packages that I had to get like it broke my flow and I didn't expect there to be like these big packages but today I told you about gene expression right the questions that you can ask like which genes are different in which tissues how are tissues different from each other how is cancer different from non-cancer cells I showed you guys that you can use microarrays I told you about the history of microarrays a little bit about how they are made which steps you have to go through to get your sample on a microarray I talked about normalization so that there's normalizations of scores and normalization of ratios that you have normalization that that quantile normalization is the most commonly used method nowadays to analyze microarray data but also that before you can do actually these normalization steps you have to do all kinds of normalizations based on background scratches nonspecific hybridization besides that we talked about statistical analysis right that if you are doing tens or hundreds of thousands of statistical tests that you run the risk of multiple testing issues which means that you are calling genes differentially expressed while they're actually not just because you did so many tests I told you about gene ontology the pathway analysis I didn't talk about today because we already did that when we discussed CAG and REACTOM but gene ontology is one of these tools where when you do these kinds of fishing expeditions and you have no idea what might come up it can tell you that no all of the genes that you find differentially expressed are located in the mitochondria so there might be an issue with the mitochondria of the animal that you're looking at or the experiment that you did where you compared cancer with non-cancer samples all of this has to do with the ribosome so it might be that something in the ribosome is causing or is interesting to look at when it comes to these types of cancer I told you about common visualizations like heat maps and dendrograms and how to do them in R a little bit of the parameters of course look at the help files there's like I think that the heat map function has like 15 different parameters and we only discussed three but for dendrograms I told you about hierarchical clustering and that for each distance matrix that you compute there are three different dendrograms or three different trees that you can build based on the fact that you do single linkage complete linkage or up GMA furthermore I told you about some historical plots like the MA plot and the volcano plot and so that they are different ways of visualizing hundreds of thousands of genes and kind of showing that there are interesting groups and I told you where you can get a bunch of free microarray data so that you can do your own analysis and like the people in the conspiracy theory group say do your own research well in this case it is definitely possible to do your own research because like I said there's almost 128 million dollars freely available data for you to download and look at good so that was my story for you guys for today found aquatic gastropod arrays yeah I told you there would be microarrays and and there's probably not just a single tissue but there's probably multiple tissue arrays as well right so if you're interested in in how are these two snails different from each other then you can actually do that like you don't have to be part of a university or spend a lot of money no you can just go there analyze the data that's available and see if you can find new stuff because all of these experiments that are in there people generally only look within their own experiment and then they compare like their results to kind of what other people found but in a lot of cases just downloading like 50 different experiments that all kind of looked at the same thing can give you a much better overview of what's going on and which genes might be involved or which genes might be very good therapeutic targets so a lot of free microarray data add like awesome stuff it's one of these resources that like a lot of people don't know about but once you know about it you you keep using it over and over again every time that we end up with doing our own microarray experiment the first thing that I do is see if I can find like five or six different mouse strains that other people have already done yeah because we work on the Berlin Fett mouse but other people are working on the New Zealand obese mouse or people are working on the DBA2 mouse right and and by combining or looking across all of these different experiments you get a much higher level overview like a more more bird's eye view of what's happening compared to just focusing only on your experiment with your five samples versus the other five samples good so if there's no more questions then I'm going to start calling some people to see if I can get rid of the packages that I got and for now thank you very much if you're watching this on YouTube like, subscribe, favorite, hit the bell icon and that kind of crap if you are watching this on Twitch I thank you very very much my moderator Misha and Xanaxin for being here today and enjoying it of course if you're watching it on YouTube you're more than welcome to join next week next week we're going to do something fun because I will be talking a little bit about let me see what the next one is documents, pptx, informatics so next week I will be talking about literature management so how do you download papers and use papers what are citations why do we use citations and these kinds of things that will be mixed with a little bit of standards for analysis so what kind of different file types do we commonly use we discussed a whole bunch of them already like fusta and but I just want to give you an overview and then we will have a talk by Aimee who will talk about her master project she's a student or a previous student of mine she's now not in Berlin anymore I think she's somewhere in Bären, Switzerland I think but she will talk about her camera trap project and how she uses machine learning to analyze photos from camera traps and to determine what animal walked in front of the trap or walked into the trap so that's it for me thank you very much and see you next time