 So welcome back, everyone, also the people who are watching this on Moodle. So we will talk a little bit more about different visualizations and how you can visualize data, right? So visualizing data usually depends on clustering data and grouping stuff together. And so here we have the data set that you will be working on today or tomorrow or the day after tomorrow, depending on when you do the assignments. But in this data set, we have measured two different tissues. So here we see HT, which stands for hypothalamus, which is part of the brain, and you see GF, which stands for gonadal fat. So what we did here is we took microarrays. So we ran like eight microarrays on hypothalamus tissues and we ran eight microarrays on gonadal fat. And of course, had these numbers behind it, they signify animal IDs, which you can also couple two different strains. So for example, a Berlin fat mouse or a B6N mouse. But the first thing that you can see in this heat map is of course is that the middle line is that there's high correlation between all of them. So what we did is we just took this matrix of gene expression values, calculated the correlation from sample to sample. And what you then see is that the hypothalamus samples are very similar to each other. You can also see that from the tree at the top. And you can see that the gonadal fats are also very similar to each other, but less similar than the gonadal fat samples are. And this already gives us kind of a pointer is that the tissue that we should be looking at is not so much the hypothalamus. It's not something that is active in the brain, but it's probably something which is active in fat tissue. Because in fat tissue, we see more structure and we see more differences in the grouping within the fat tissue. But it's just a basic correlation. And this is one of these things that I always do to kind of check if the data is correct. Because of course, if one of the hypothalamus samples would cluster together with the gonadal fat samples, then you would know that one of the microarrays was swapped. And that happens, right? Because you pipe it in your samples into a 380 well plate. And of course, when you pipe at 380 samples, people make mistakes. And so just basically checking if the assumption that hypothalamus is more similar to hypothalamus, gonadal fat should be more similar to gonadal fat. And that already gives you a first, more or less quality check on your data to make sure that what we are looking at is kind of the data that we send in. So heat maps and dendrograms are very coupled together. Like you can see, usually when you make a heat map in R, it comes with kind of a dendrogram showing you the distance between the different samples or between the different probes. So this is across samples, but you could also do the correlation clustering across the different probes. Dendrograms in that sense are nice in a way that they allow you to see different groups within your data. So a dendrogram, and there's many different types of dendrogram, but generally when you have a dendrogram, then you see that there's a branch of the dendrogram which kind of clusters together and these have similarities to each other. And of course, the higher the bar between two things, the bigger the distance between them. So the bigger the difference in a way. So dendrograms are always made on a similarity or a difference between expression profiles. So you need to define a certain mathematical metric which expresses what you mean by difference, right? So there are three different distance measurements of which the Minowski is just an extrapolation of the other two. But the most basic distance measurement that you can come up with is the Manhattan distance measurement where you just say, well, I'm going through all of my genes, right? So because I want to know the distance between the different samples. So what I do is I go through all of my genes. So one to P and then I'm just saying that, well, I'm looking at the expression of the gene in the first issue and then I'm subtracting the difference in the second issue and then I take the absolute value and I just sum all of this up. So I just go through, if I have two samples and I want to know how similar they are or how different they are, which of course the inverse of similarity. And then you can just go through each of your measurements. So in a microarray, these are genes. And what you do is you just take the expression of the first gene minus the expression of the first gene in the second sample. You take the absolute difference and then you just sum that up for all of the different genes. And so if you want to visualize that, because here you're always taking absolute differences and you get kind of a step-wise way of going through your data. So for every difference, the difference becomes bigger or smaller. No, the difference always becomes bigger, right? So you take the delta, so the absolute delta and then you sum those all up. Another way of doing distance is Euclidean distance and Euclidean distance is based on having a straight line from top to bottom because the Euclidean distance is usually done when you have full numbers like integers. So you would say that I have two minus one or I have one minus two. But the Euclidean distance is based on how you would measure distance on a map and then you would say, well, I do the same thing again. So I take the distance or I take the gene one, sample one, I subtract gene one, sample two, I then square it and then I sum all of these up. So I take the sum of squares, so the sum of squares of the difference and then in the end I just do the square root of it and this is similar to going from point A to point B in more or less a straight line. Then the Minovsky distance is kind of a generalization of this because here you can see that this formula is the same as the second formula. The only difference here is that I do this to the power of two and then in the end I take the square root which is kind of to the power of one divided by two, right? And this is all one, so this is X minus Y to the power of one and then you take more or less the square root to the power of one divided by one, which of course doesn't change it. But the Minovsky distance is a way of kind of generalizing distances and the distance here is just X minus Y to the power of M, you sum all of these up and then you take the Mth square root of the distance value. So M in this case is a positive integer. So P here is the number of probes and D is the distance between profiles X and Y and X and Y are the two different expression profiles. So the two different kind of columns in your matrix for gene measurement. So this is three different distance measurements. They have kind of a different way of how you would look at it, but all of these distance measurements have their own advantages and drawbacks because the Manhattan distance is much better when you look at single nucleotide polymorphisms, Euclidean distance is something where you would say that it's probably very suitable for gene expression differences and the Minovsky distance is very good to exaggerating small differences, but they have their advantages and disadvantages. And so in the end, if you have a little example, I have two profiles, so I have sample one, sample two, gene one, gene two, gene three, gene four, gene five. And so the Manhattan distance is nothing more than to take the difference between these two and then sum it up. So you have one to zero, which is plus one, zero to zero is zero, one to one is zero, zero to one is also one because of the absolute, and then they have zero to zero, which is again zero. So the Manhattan distance between these two profiles is actually two, and if you would take the Euclidean distance ahead, then you would take the square root of one to the power of two, zero to the power of two. So this will give you a slightly different distance measurement, and the Minovsky distance can put more weight on large differences or more weight on smaller differences. But it's just a different way of defining mathematically how similar two things are or how different two things are. And of course, if you're dealing with whole numbers, then generally you would prefer the Manhattan distance. If you're dealing with floating point numbers, then you would generally go for the Euclidean distance unless you are only interested in differences which are very big because then you would switch to the Minovsky distance and then you would kind of scale using the power of three or power of four, depending on how much weight you would wanna give to a difference of one to the power of two or like two to the power of two or one to the power of four versus two to the power of four. And so you can kind of scale your distance measurement. Of course, most of these distance measurements are all linear related. And so if you have two profiles which are very similar based on the Euclidean or are based on the Manhattan distance, they will also be very similar based on the Euclidean difference and they will also be very similar based on a Minovsky distance. But that is Minovsky distance can change the ordering of similarity. But Manhattan and Euclidean distances are used a lot. Why is it minus one? Because in this case, if you look at the formula, there's no absolute and the absolute is made by the square. Because if you take a negative number to the power of two, it's always going to be a positive number. And in the Minovsky distance, you see that the absolute comes back in because of course if you take something to the power of three, then a negative number will go into a negative value, right? So it's here the minus one is just based on the fact that Euclidean distance doesn't have the absolute sign because it is always going to be positive based on doing stuff to the power of two. Well, the Minovsky distance will not be. And since we're looking at similarities or dissimilarities, we don't want a gene one which is highly expressed in one and gene two being lowly expressed in the sample to kind of cancel each other out. So for Euclidean distance, no absolute is necessary, making it a little bit computationally more efficient. Is that clear? Why? Because minus one to the power of two is still one. Good, all right. So when you then build your dendrogram, right? So then you now have a way of defining what the difference or what the similarity is between two profile head, then you can now do it for multiple profiles at the same time, right? So something is always similar to itself, meaning it has a distance of zero. And for example, one sample one compared to sample two has a distance of one. And then when you do the same distance computation from sample one to sample three, you figure out that now you have a distance of five. And so you have the elements of this little matrix here. If you look at the ith row and the jth column, and then that is the distance between profiles y and profile j. And so all elements are positive or zero and that is because we only count differences. So two things cannot cancel each other out. Distance is zero means that two profiles or two gene expression profiles are identical, which is good to know because sometimes you make a pipetting error and you have the same sample on your array twice. And then of course these two samples will have a very, very small distance because the distance there will only be based on like a technical variation and not because of biological variation. And so the elements on the diagonal are always zero because that means that you're comparing an element to itself. And symmetry means that if an element at yj is located, so one to two has a distance of one, that means that one to two also has a distance of one. So when you write down these distances matrices for hierarchical clustering, you generally ignore the lower triangle or you ignore the upper triangle. And so the dimension of this is n times n, n being either the number of genes or the number of samples, depending on what you are clustering together, right? So if I'm clustering genes together, if I want to say that, well, gene one and gene two are very similar, and then I'm calculating the distance based on the individuals. And if I want to say that two individuals are very similar, then I'm calculating the distance based on the individual genes that I have measured. But the dimension is always n by n, n being the number of genes, if you're interested in clustering genes or n being the number of individuals if you want to cluster individuals together. So how do I now make a dendrogram? So the start of the procedure is very similar to what we saw when we did cluster W. As we search for the smallest element in the distance matrix, and in this matrix that we had, we have a distance of one between profiles one and two. So again, like similar to what we did in what cluster W is doing, is we form a cluster of one to two from profiles one and two, and then we have to calculate a new distance matrix, right? So the distance matrix is now more or less scaled because now we have a distance from one to two, which is one, right? And the distance from one to three used to be five, but now since we cluster these two together, and we now end up with a distance of four. So this cluster now has a distance of four to number three and a distance of eight to number four and a distance of five or seven to number five. So we just take the biggest distance. The way that you calculate this new distance matrix can be done in three different ways, and we will discuss the three different ways that is possible. So the distance between these clusters, you can calculate based on single linkage, right? Single linkage means that if you have a cluster, then if you compare this cluster to another sample, then the distance between this cluster and the new sample is defined as the distance between the most similar objects in the cluster. So that means single linkage. So when we go back here, the fact, or the question is, is this based on the two which are most similar? Is this single linkage? Then we can see that, yes, here we are using single linkage. And why is that? Because the distance of number two to number three used to be four, the distance of number two to number four used to be eight. And you see here that the distances in the cluster are the same as the distances of number two, right? And number two used to be the most similar to three, because one is further away. So here we are dealing with single linkage. The complete linkage is when you do the opposite, right? So the distance between the clusters is computed as the distance between the two most dissimilar elements in the cluster, right? So if we look again at our little example, how would that have looked like? Well, in this case, we know that two is most close to three, four, and five. But if we would have done the complete linkage, then that would mean that we would have had the numbers five, nine, and eight here. So five, nine, and eight, and then we would be using the complete linkage. And so single linkage based on two, so the element in the cluster, which is closest to the new element which we are comparing to, if we take the most similar one, we talk about single linkage. When we take the most dissimilar one, we talk about complete linkage. And so this is the way that these are two ways that we can build. Then the third way is taking the average linkage. And so the distance between two clusters or a cluster and an element is taken as the average of all distances between the numbers of things in X and the number of things in number of Y and B, right? That is the mean distance between each of the elements of the cluster. And so if we would have done that with this little thing here, then this number would be four and a half. This one would be eight and a half and this one would be seven and a half. But you see that in the end, it doesn't really matter which one of the three you take because the ordering will be very similar. Not always, especially not when you start building clusters with three elements or four elements or five elements. And because then things might be starting to change kind of higher up in the tree. But the original, like if you create, if you go from having no clusters to having one cluster in your data, and then what happens is that no matter which linkage method you use, the two things which are similar to each other, it will also have more or less a similar distance to the other elements. But you can recognize what kind of distance someone was using by looking at the numbers which occur after you did the first clustering step. But single linkage, complete linkage and average linkage, had they again have their own advantages and disadvantages when you're building a dendrogram or a tree or clustering your data together. But remember that you can use three different measurements and these three different measurements give you slightly different trees. So of course, hierarchical clustering is an iterative process. Just like the cluster W process is, so you search for the smallest element in the new distance matrix, you then form a cluster for these new elements. So in this case, the two which have the smallest distance is four and five. So I create four and five, I group them together and then I calculate the new distance matrix based on the same, because you can't switch your linkage between. So you can not say, well, I used to do it based on single linkage, but now I'm going for complete linkage. And then here you repeat until everything is merged into more or less one big cluster. So here you go from, so you say one is more similar to two, and then the next step four is most similar to five, and then the next step means putting three in the cluster of four and five because three, the distance is the lowest. And then in the end, you would merge everything together and then you would be done because then you're at the root of your tree. And so how does this look? When you do this using single linkage, you see that one and two are more or less closest together. And if you do it based on complete linkage, you see that the structure is the same, but of course the distance measurements or the y-axis of the clustering, and when you do it for single linkage, things will become closer together because you're looking at the most similar element. Well, if you use complete linkage, you get a more spread out scale because you're looking at the most similar objects. And that is how you build endograms. It's just an iterative process. You can program it in yourself, but of course R has these in there. So if you do a HACLUSD, and then you can specify which of the three linkage methods you want to use. Historial visualizations. There's a C missing here. So it's to be historical visualization. So when we talk about microarrays, there's a couple of historical visualizations that people used to show or still show, which kind of give you an idea of how good your data is and kind of what is going on in your data. So I wanted to show you two of these. One of these is the MA plot, which is the, so the MA plot stands for the mean Blunt Altman plot. So it's the A type of this type of plot, but you can forget about that. That's not that interesting. But the MA plot is made by on the x-axis showing the A and A stands for the mean value of a certain probe on the array and the y-axis stands for the log ratio between, so the log ratio of a gene. The same thing when we do volcano plots, on the volcano plots, when we are comparing, like if we did a two side, if we have two groups, right? Because we're always comparing two groups. So we have group one, which is for example, the fat mouse. We have group two, which is the reference mouse. And so we look at the mean value of the probe versus the log ratio between the two groups. So we calculate the ratio between the two groups, take the log of it, and then we plot that on the other axis. The volcano plot does the same thing, but instead of showing the mean value on the x-axis, it shows the log ratio on the x-axis, and then it shows the lot score, so the minus log 10 of the p-value on the y-axis. And these plots can be used to kind of compare how well your microarray experiment went and where the significant genes are. So here you see an MA plot. So the MA plot is something which people used to do a lot. And so all of these dots here are genes. So the expression of a gene in the samples or across a group of samples, right? And so what we see here is that there's a whole bunch of genes which have kind of no expression, right? Because the A is the mean expression. So here we see genes which are not expressed. Here we see genes which are highly expressed. And then on the y-axis, we see the log ratio of the gene expression between the groups that we're looking at. And so in this case, here we would say that the genes which are interesting are the genes which have a relatively high expression in our sample, but which are also very different between the two groups that we are looking at. And so the red dots here are the dots of genes that are the most interesting genes in our experiment. Why? Because they are highly expressed across all the samples and they show a big difference between the two groups that we are interested in. So for example, that mouse versus lean mouse or cancer tissue versus normal tissue. The same way you can visualize in the MA plot, you can also visualize using the volcano plots. So the volcano plots is a relatively new way of showing a gene expression data. But here you see the M, so the log two of the ratio, and here you see the minus log 10 of the p-value, right? So we do a log two ratio and then we do a statistical test. And the reason why it's better to show a volcano plot than an MA plot is that the MA plot only looks at the difference between the groups but does not consider the standard deviation within the groups. Well, if you do a statistical test, right? If you calculate a p-value, then the p-value is based on the difference in the average, but the p-values are corrected for the standard deviations in the group because hey, if two groups have a difference of 10 but both groups have a standard deviation of one, then the difference is very significant. But if both groups would have a standard deviation of 50, then of course this difference of 10 would not be as significant. So here in green, you see the genes which are very interesting because they are significantly different and so they are significantly different, but they also are showing a big difference in the groups. So the green dots are down-regulated in the sample of interest and the red genes here are up-regulated. And of course, down and up-regulation is just the definition of what your reference group is. So the colored genes here are the genes which are the genes of interest. And here in the MA plot, you see the same thing, but the problem here is that these genes, although they are different between the two samples, they might not be significantly different because it's not scaled for the expression levels. All right, so we've been talking a lot about microarray data and I think that everyone wants to do something. So when we do the assignments or when you do the assignments because I already did them like five years ago and I made them five, six years ago, but when you want to do something and you say, well, I really want to have some free microarray data that I want to play around with, then there's a lot of free microarray data available. So for example, the gene expression omnibus, which is ran by the NCBI, it stores around 25,000 experiments. It has around 600,000 free microarrays that you can download. It only provides storage and retrieval so you can't do the analysis online, but for example, using R, you could say, well, I'm interested in the gene expression in a certain type of fish or in a human or in cancer tissue from lung cancer or testicular cancer or brain cancer and I want to compare these things together. And then gene expression omnibus cut your back because they have 600,000 arrays that you can download and they are free, right? So you can just download them, you can use them and you can write a really nice nature or science publication if you find something which is, for example, shared between brain cancer and lung cancer, right? Because if two cancer types have a certain common gene which is always showing a differential expression compared to non-affected tissue, then this might be a very novel finding and 600,000 arrays free for downloading at gene expression omnibus. The problem here is that they only provide storage and retrieval and they don't really do quality control. So anyone can upload data to it and they will just store it for you and that's perfectly fine. But on the other hand, you have Array Express from the EBI from the European Bioinformatics Institute and they have more arrays, right? They also collect the same data and so you can upload your experiment there and they also have like 700,000 arrays available for you but they have a certain part of their database which is called the gene expression atlas and the gene expression atlas is a curated re-annotated data archive. So that means that someone looked at the data and said, yeah, these arrays make sense and they are really done on mice because you can see indeed that the expression but they have much less arrays which have been curated. So these are kind of high quality data. They have also called low quality data which is just available by people uploading their stuff. They provide storage and retrieval just like NCBI but they also have like an online analysis platform. So you don't have to download all the data because sometimes this data can range into like the gigabytes and before you download like 60 gigabytes of data you might wanna know, well, does it make sense to start analyzing this data? So you can have an analysis online across different conditions or across different experiments to see if downloading this data would make sense for your experiment. So if you look at Geo, so gene expression omnibus to curated data and then you can just like, hey, it's just a website, it has a search box and so if you search for mouse you will get all of the arrays where mice have been put on and of course there's different array types. So you have Illumina and Affymetrics arrays and all of these arrays have kind of been upgraded through the years. So they start off with very old arrays from like 2000 which are only measuring like 10,000 genes and at the newer arrays they have like 200,000 to a million probes on there. So they are much more informative and they look at the genome in much more detail. Array Express is very similar, looks like this. Again, just a basic search box you can search and of course you can filter and these kinds of things to find data that you have. And of course, don't forget that there are real publications in here. Like, well, a friend of mine or someone who I know via like just conferences and stuff he wrote a really, really nice nature paper just based purely on free micro array data. So the only thing that it cost him was his own time and so you can, if you find something really interesting or are really interested in things like cancer or what are the genetic differences between different types of tissues of animals and then you can really get your hands on a lot of free data which is not useless data because reanalyzing data and combining different experiments together can lead to really, really high impact publications in the end. But this was it for today. So the summary what we did is we talked about experimental design. Can you send a link to the paper? Yeah, yeah, of course, of course. I can see if I can, well, let me first do the outro then I can stop the recording and we can just continue chatting about SARS-CoV-2 and doing stuff in R and other things. So we talked about experimental design, the questions that it allows you to answer. We talked a little bit about micro arrays and how they function and which steps there are in the kind of wet lab processing of micro arrays like hybridization and washing and putting them in the machine and scanning them. We talked about the bioinformatics part of the analysis. So in which steps of micro array design and analysis are bioinformaticians involved. We talked about normalization that are two different types of normalization. Statistical analysis, we talked a little bit about gene expression profiling and multiple testing. I told you about gene ontology and pathway analysis and of course a little bit about visualizations that you can use things like heat maps and dendrograms and these kind of older plots like the MA plot and the volcano plot to kind of present your data in like one go without having to kind of give people a matrix with like hundreds of thousands of values. And of course, when we talk about dendrograms, we always have to talk about like hierarchical clustering and how that works. So that was it for today. So if there's any questions, then feel free. How many lessons are left till the exam? All right, those are the important questions. So next week, we will be talking about standards for analysis. So that's talking about different file formats and file structures, which are nowadays standard. We will talk a little bit about Miame and these kinds of concepts. After that, we have a lecture about literature management. So that's the second lecture. And then I have a, oh, no, wait, that's the, sorry, I'm looking at the assignment. So not all upcoming lectures still have assignments. So yeah, standards for analysis, literature management, and then we will have the overview lecture. And we might, I might squeeze in one more lecture since we still have like one or two weeks that before you should be doing the exam. So I might have a friend of mine talk a little bit about the work that he is doing based on like closed ecology management. So he is dealing with, so his goal is to make a closed ecosystem to support like things like space flight. So he's making these like ecosystems which have snails and that is kind of self-contained. So you can take it with you on a rocket and when you fly to Mars, you won't starve halfway through. So that's kind of the things that he is interesting. So yeah, he works a lot with like these little water snails. So I asked him if he could make a presentation. He also uses things like little raspberry pies or Arduños to measure this stuff and he then brings the data into R to automatically make graphs and to kind of check what these things are doing. So I thought that that would be a nice addition. I don't know if he has the time to finish in time. I will contact him again and put a little bit more pressure on him and tell him that you guys are really interested in something like that. But that won't be like a full three-hour lecture and it won't also be part of the exam. It's just for you guys that you can see that it's not all statistics and sitting behind a computer. There's also a lot of things that you can do with a soldering iron and like an Arduño and a tank filled with fish and snails and plants and these kinds of things. So it's more applied by informatics. It's really like sitting down using your 3D printers. Yeah, so we have two left in the summary and we might in the middle have an additional lecture from Misha to talk about his work and how he's building his things. So that's it. Is there any other questions, remarks, other things? Perfect, thanks. Yeah, thank you for attending. Good, so if there's no more questions then I will stop the recording.