 Welcome back everyone, if you're watching this on YouTube or if you're watching it live on Twitch. Today, gene expression analysis. This is the overview for today, so let's just jump in it, right? So first questions, right? Because there are many questions that people have when we talk about genes and gene expression. But the general questions that people want to answer when they do gene expression is things like which genes are differentially expressed between healthy and diseased tissue, right? For example, in cancer research, one of the main topics is figuring out which genes are upregulated in cancer so that they can be used as therapeutic targets. If you want to know more about these genes, right, then generally you have the question, are the genes which are differentially expressed part of a certain pathway, right? Do they all have a certain function? Are these all mitochondrial genes or all these old genes which are involved in ribosome formation or are these genes which are involved in a excretion pathway? Another question which is commonly asked is which genes are expressed in which tissue, right? Because there's a lot of tissue profiling going around where people don't know the tissue that they're getting and based on the gene expression they want to kind of figure out which tissue they are looking at. And of course, if you think about drug treatments and you want to know how a certain drug has an effect on its target tissue, right? So for example, you're interested in liver, you have a new liver medicine and then of course, you can do gene expression where you take people, half of the people get a placebo, half of the people get the real drug, and now you want to see what the difference is. So which genes are reacting to the treatment that I'm giving these people? And of course, head that also gives you some insight if your drug is working and if you're targeting the pathways that you want to target. So these are more or less the general questions that people ask when they do gene expression and that people want to have answered. All right, so the first way to measure gene expression is using real-time polymerase chain reaction. So we already had a whole lecture about PCR and what you do is you use oligoprobes, but instead of monitoring it at the end, right? Because normally when you do PCR, you do 30 cycles to amplify and then you put it on a gel and you see if anything got amplified, yes or no. But when you do real-time, what you do is every cycle, so you see the cycles here, so at every cycle you measure the intensity of a certain fluorophore and then you see how these things are expressed, right? If we do PCR, then of course we have to transcribe our messenger RNA into cDNA because PCR only works on DNA, it's a DNA method. So I made a little example graph here, so the green in this image is the housekeeper gene because all of the expressions that you get are relative, right? So what you do is you just cycle, you define a threshold and you say, well once a certain, once the intensity hits this threshold, I'm going to calculate, for example, the mean of the red gene, which is a gene of interest, then I'm going to calculate the mean of the green group and I'm calculating the mean of the yellow group. And then based on which product comes first, because the more product you start it off with, the quicker it will reach your threshold, right? So what we can say here from this little example graph is that the red gene is higher expressed than the housekeeping gene, while the yellow gene is lower expressed than the housekeeper gene. So that is where this delta Ct comes from. So generally, you don't want to know if a gene is higher or lower than your housekeeper, you want to know if a gene is differentially expressed, right? So you can imagine that if red is gene A, and yellow is gene A, but in a different tissue, right, then now we can look at the relative expression of the, of in the one tissue compared to the relative expression in the other tissue, right? If the housekeeper has the same mean. And that is, this is called the two to the power of minus delta delta Ct method. And then you can compute the relative expression of a gene in tissue one compared to a gene in tissue two. And you can do this in R, but generally people just use Excel files for this because there are like the standard Excel files where you can just fill in your Ct values, and then the Excel file will do the calculation for you. So if you ever have to analyze some real time PCR data, or real time QPCR data, then just download one of these Excel files, fill in your values at which certain genes reach certain thresholds, and then the Excel file will do this two to the power of delta delta Ct4. One little thing I want to note here is that this two to the power of is actually because there's an exponential amplification, right? Every cycle. So if you come, if the red gene here comes before the housekeeper, right? So if it comes one cycle before the housekeeper, it means that it was two times higher expressed in the beginning. If it comes two cycles, it's actually two to the power of two. So there was four times the amount of the gene relative to the housekeeper. That's where the two to the power comes from. Because every cycle we duplicate. So we have to then use a two to the power of delta delta Ct. So this is one of the first ways to measure gene expression. Another way of doing this is using macro or micro arrays. So micro arrays were developed in the 1980s. And the initial arrays that were developed are macro arrays. So macro arrays are around our glass slides, which are nine by 12 centimeters. And then on these slides people would put little pieces of DNA, right? So a micro array is just a collection of microscopic DNA spots attached to a solid surface. Each spot contains 10 to the minus 12 moles of a specific DNA sequence. And these are known as probes or oligos. Right? So what we do is we take our sample, we go from our RNA sample using reverse transcriptase to cDNA, and then we label our cDNA using either the fluorophore or some chemical luminescence that we want to use. Nowadays, I think all micro arrays use fluorophores. So they use Psi 3 or Psi 5 to label the DNA either green or to label it red. And then what you do is then the RNA abundance. So the abundance of the original RNA is determined by hybridizing your sample to the micro array, right? Because the more DNA hybridizes to a certain spot, right? So we have a spot which starts, for example, the myostatin gene. And the more intense that the spot is shining, the more abundance or the more myostatin RNA was available in the original sample. So the micro array workflow looks like this. You have your sample. Of course, you first need to purify, right? You need to get the RNA out. Then once you have the RNA out of your sample, you do reverse transcriptase, which means that the mRNA is transformed into cDNA. And then in the next step, you label your cDNA using a fluorophore, be it red or be it green. Then once you have it labeled, you take one of these glass slides and then you hybridize your sample, which just means you pipet your sample onto the micro array, which are these very, very small plates. And then you wait a little bit for the DNA to hybridize to the individual spots, and then you wash it clean. So you wash off all of the remainder. You put it in one of these machines, and these machines, they have a nice laser. So they will scan each of these spots, and they will give you an intensity. And then the next step is computational, because the next step means that you want to look at the intensities and do normalization. So hybridization is based on the fact that a complementary sequence will form when you have a spot, which is single-stranded DNA. Then you have your cDNA, which is also single-stranded, and they can hybridize together based on complementary strands. So that's how hybridization works. And the more complementary bases there are, the tighter bonding there is between two strands, right? So of course, if you have a probe on the array, the probe is similar to a primer, right? Because it needs to be unique to the sequence that you're trying to capture. And of course, there are issues there as well, because sometimes your sequence is not unique enough, and it's actually targeting like two of the different miastatin protein or messenger RNAs, which could be formed. So when you do a microarray design step, which is also part of bioinformatics, I've done like three or four different microarray designs in the couple of years that I've been working here, and then of course, the same rules that apply to primer design, they also apply to oligo design for a microarray. It's just that it's much more harder, because instead of having two primers that need to function the same, you're now designing 20,000 or 40,000 oligos, and all of these oligos are not allowed, so they all have the same rules. So they have to have the same hybridization temperature, the same, they are not allowed to form any hairpins and these kinds of things. So look up the primer design lecture if you're interested in how you design a microarray, because the same rules apply. So there's two different types of microarrays. We have one channel arrays, which is just a single, which is just a microarray, which measures a single sample, and it's just a microarray where you put, for example, human DNA on and then it will show you the intensity of the different genes. There's also two channel arrays, so two channel arrays are a little bit different because you can get a relative expression, and this is often used in cancer research. So when you have normal tissue, normal lung tissue, and you have tissue from a lung cancer cell, then what you can do is you can color the one green, the healthy tissue, you can color green, and then you can do the disease tissue, for example, in red. So the colors are named Psi 3 for the green color and Psi 5 for the red color, and the way that it looks like is like this. So this is a photo of a microarray slide, there's one, two, three, there's nine different microarrays on this little glass plate. So on each of these you pipe it a disease sample and a healthy sample, and then what you see is that in the end you get when the microarray is scanned, you get dots. So when you get a yellow dot, the yellow dot just means that this gene was equally expressed in the one sample as in the other one. And when you get a red dot, it means that this gene is highly expressed in the cancer tissue, but it's lowly expressed in normal tissue, and a green dot means the opposite, so it means that the gene or the sequence that you were targeting was highly expressed in normal tissue, but it was very lowly expressed in cancer tissue. So why do we use microarrays? Well, for example comparative hybridization. So comparative hybridization is when you compare a genome. So this has nothing to do with genes or gene expression. What you do is you just take DNA, you chop it up into little pieces, and then you hybridize the DNA, so the genomic DNA, to a microarray. And this can be used to do things like genome comparison. So if you're, for example, if I have a microarray, which was designed for humans, and I take the genome of a monkey, I can chop up the genome of a monkey, put it on a human array, and then see which parts of the human genome also occur in monkeys. They are mostly used for expression profiling, right, because that's what we're talking about. So what we do is we take whole messenger RNA, we reverse transcribe it, and we can do this in two different ways. So we can use reverse transcriptase using poly T primers, which means that we amplify only messenger RNA, or we can use random primers. And if we use random more or less primers in the reverse transcriptase step, then we amplify all RNA. And of course, most of these will be ribosomal RNA, because ribosomal RNA are the most abundant type of RNA in a sample. Microarrays are also nowadays used when we genotype snips. So then we have two probes on the array, one which targets, for example, the A allele, and then the same probe, but now it targets the T allele, right? And then when we hybridize our sample to the microarray, then if the sample had an A, then the T probe will start lighting up. And if the sample had a T, then the A probe will start lighting up, because the complementary strand will be bound. And this is how you can do SNP genotyping. So you can also use microarrays to genotype individuals and see at which position, or at every position in the genome, had this animal have an A, a T, a C, or a G. Newer is chip immunoprecipitation, which is called chip-on-chip. And this can be used for epigenetic or regulation studies. So what we do is have we have DNA or RNA, which is bound to a protein, right? So for example, we're interested in a DNA binding protein. So we have a protein which binds DNA, and now we want to figure out where this protein binds. So what we can do is we can make an antibody against the protein, then pull out the protein bound with the DNA, and then instead of using the whole fraction, we only use the fraction which we pulled out by our antibody. And that then means that we can look at where this probe, where this protein was binding in the genome. And we can use microarrays for that as well. But we'll focus on expression profiling and mostly on expression profiling of messenger RNA. So when we do a microarray, then the first step in the bioinformatics pipeline is to create these oligo arrays, right? For many species, microarrays are available. For example, if I'm working on mouse or if I'm working on human or pigs, then there are more or less off the shelf microarrays that I can buy. But if you're working on a different species, which does not have a microarray available, or if you're working on a species that has microarrays available, but you want to look very carefully into a very small region, you can design a microarray specifically for that region. So instead of having probes that target all over the genome, which target all of the genes, for example, if I'm interested in the mouse genome, but only in chromosome one and two, then of course I can make a custom microarray which targets chromosome one and two, and has many, many more probes than a standard array, right? Because you only have 20,000 or 30,000 or 40,000 spots on the array, and which regions you target is up to you as bioinformatician. So when you create these oligo arrays, this is done in a TDT file. So this is a proprietary file format, which kind of describes microarrays. So it has the probes listed there, and these things, these TDT files can be used by microarray spotters to make microarrays. So hey, just you put in a little glass plate, and then the machine will synthesize all of these little probes that you want from the TDT file, and we'll then print them on this microarray. The next three steps are steps which happen in the lab. So we acquire our sample, we extract the RNA, and then we do an RNA to DNA reverse transcription, and then here we have an optional PCR step. In many cases, you will use a PCR step just to kind of focus on the messenger RNA, and then you do the labeling using either the green or the red color, and or if you do a two-color microarray, you need to label two samples. And then the next step is hybridization, and then you scan it. And when you scan it, the machine gives you a file, which is a TIF file, and this TIF file is nothing more than an image file. So you can just look at it. So these TIF files generally look like this. So hey, and these are the files that you get from the company. So that's the raw, raw data, where it's just like four microarrays by like 10 microarrays. So there's 40 microarrays on this single glass plate, and each one of those will have these spots on there. So this TIF file is then used to make or to do calling of the intensities, right? So the machine will scan all of these dots, and these dots are then transformed into a cell file most generally when you use affymetrics arrays. So affymetrics arrays are stored in cell files. Cell file again, proprietary format, and it's just to compress down. These TIF formats are relatively big. This is a file which is generally like 500 megabases, while this cell file is only 10 megabases. So of course, if you're running hundreds and hundreds of microarrays, you're not going to store 500 MBs of data when you can store more or less the same amount of data into a more or less compressed binary format. From these cell files, we then extract the expression levels, so the intensities of the different spots on the arrays, and that is generally done in a text format. We do data normalization, again stored in standard text files, and then we do things like clustering, which also store in text files and data interpretation. And of course, this is generally the data interpretation, it's not a text file, but it will be a file which might be an image file, like a histogram, or a heat map, or something like that. So bioinformatics is involved. Like I told you guys, when you design the oligomers, which are on the microarray, and of course here, look at the primer design lecture, because the same rules for primer design affect the oligo design for microarrays. So we want highly specific probes to prevent secondary targets. These probes are not allowed to create hairpins, and since there are 100,000 or more probes on an array, they should all work under the same condition. So it's a much, much harder task compared to just designing two primers for a PCR reaction. And of course, bioinformatics is also involved in the image processing software, right? There was a bioinformatician who wrote like the spot identification code, because you need to distinguish one spot from another. And of course, there's an intensity calculation, which also is part of bioinformatics. And of course, the main part of bioinformatics, of course, is the analysis and the statistical analysis of microarray data. So normalization, and we're going to talk a little bit about, or a lot about normalization, because it is very important to normalize your microarrays. And that is because there are a lot of variations that occur during the process of your sample hybridization and scanning. And this has to do with temperature. And not so much temperature in like the machine itself, but the ambient temperature and the ambient air pressure and things like the humidity of the air. And all of these has an effect on how well your DNA will bind to the microarray. So if I'm doing a microarray on Monday, and on Monday, the temperature is 90 degrees Celsius, and it has an 80% humidity, then of course, the intensity of this array will be completely different from an array that I do two weeks later when the temperature is 25 degrees Celsius and the air humidity is much lower. A lot of variation also comes from the DNA quality. So because we take RNA, we do reverse transcriptase, but if RNA is very unstable. So if I prepare the same sample 10 times, so I extract RNA and then I do the reverse transcriptase step, then there will be a lot of variation coming from that as well. And besides that, there's variation which comes from the manufacturing of the microarrays. Not all microarrays are equal. If you have a batch, then these batch of microarrays are generally very similar. But if you buy a microarray in 2017, and you buy the same microarray in 2019, then there will be manufacturing differences as well in affecting the efficacy of the arrays. And of course, comparing and finding groups of data with statistical methods is also part of the microarray workflow. And then of course, we do clustering, because that's what we want to do because we want to see similarity in expression profiles. Because in general, when you do microarray analysis, you're not interested in a single gene. If you were interested in a single gene, you would have measured it using QPCR. It's much cheaper, much faster than doing a whole microarray. Microarrays are generally when you have no real idea what's going on, but you kind of want to know which genes might be involved, right? It's a fishing expedition. So you suspect that some genes will be upregulated, but a priori, you have no idea which genes. And that's why you can't do QPCR. You just measure all of the genes in the human genome or in the mouse genome. So a lot of things can go wrong when you do microarrays, right? They are little glass plates. So a lot of things happen, things like little fibers hitting your microarray. Here you see spots, right? So these spots are like microscopic. So a little hair or a little eyelash that falls onto the array will have a massive influence on the probes, right? It obscures a whole bunch of these little spots. Besides that, because you are pipetting, air bubbles also occur. So you see sometimes that here you see that the edge of this microarray, there's like 20 to 40 probes that actually did not work at all. And this is just because you pipet stuff on the array, and sometimes a little air bubble comes in. And of course, you need to get rid of those, or you need to kind of remove those probes from further analysis then. There's also things like side effects, because microarrays generally look like this. So you have four arrays on the y-axis, and you have like 10 arrays on the x-axis. There is a big issue with the edge effect. There's a lot of spatial bias, which means that if you get close to the edge, you see that DNA more or less starts clumping up to the edge of the microarray. And of course, you can't really use these probes, or sometimes you can recover them. But there's a lot of edge effects and spatial bias going on. There's also spatial bias from pipetting. If you put your drop exactly in the middle, then the drop will kind of spread out across the, but if you put it on the side, then it will kind of flow across the array. So the intensity will be more intense on the area where you put your drop. So in the middle of the drop, it will be most intense. Besides that, we also have things like background haze, and this just means that the hybridization step generally took too long. So the DNA, instead of hybridizing to the spots, it starts hybridizing to the little glass plate as well. So you get this haze, so you get this big green color across the whole array. And some of these can be fixed, right? The background haze can be fixed using background correction. And had these edge effects, the spatial bias that we have on the array, can be fixed using spatial normalization. So there are very, very dedicated packages in Bioconductor to kind of fix these kinds of issues with your microarray. Of course, the fibers and scratches, they just mean that you lose some of the probes on the array, and the same holds for air bubbles. It just means that you can't use those probes. So background correction is there to adjust for non-specific hybridization, generally hybridization to the array itself instead of to the spots. And also for hybridization of transcripts whose sequence do not perfectly match to the probes on the array. So in the old days, we could, because in the old days, we had macroarrays, right? So the spots were visible by eye. You could look around the spot to see how much like non-specific hybridization there was, right? So you had the intensity of the spot, and then around the spot, you would see a little bit of haze. And you would say, well, the spot itself has an intensity of 10,000. The haze around the spot has an intensity of 1,000. So I'm just going to subtract that. But now, because these microarrays are so small, and these spots are like microscopically small, we have to use endogenous negative control spots, which just means that when we make an array for humans, we also put some probes on there for plants. So sequences that do not occur in humans. And that allows us to see how much non-specific hybridization there was. Because we can look at a spot, we can look at a spot, which is a control spot. And then if we see that the control spot has a certain intensity, then this intensity just gets subtracted from the real spots on the array. If you're using affymetrics arrays, they actually put every probe on the array two times. So they put the real probe on there, targeting the exact sequence. And then they have something which is called a mismatch probe. So that is a probe, which is having a sequence, which is almost identical to the sequence that you're interested in. But they do two or three mismatches in this probe. So it allows you to check the probe, the real probe, so the matching probe, to the mismatch probe, and get an idea of how much non-specific hybridization there is. And this allows you to do normalization or to do this background correction. Spatial normalization is to get rid of these edge effects or these effects, which affect more or less globally on the array. And so here we see a slide, which has four arrays on there. And we see that there's a big effect of like the middle, right? So in the middle, we see that there's like all the spots have a much lower intensity than the spots on top of the array. And so spatial normalization can be applied when there's a strong dependence of the intensity level of the probe and their spatial location, right? So the spatial location means the physical position of the probe on the microarray. So generally, when you have a microarray, a probe will not be on the microarray once, right? But they will put it on three or four times. And when you see that, hey, you have the probe A here, probe A here and a probe A here. And when you see that there's a big difference in the intensity between these three similar probes, then you know, okay, so I should spatially normalize. So what happens is it takes the array and it starts upping the intention of the probes on this side, has so in the middle, to make sure that you don't lose a lot of information on your arrays. So spatial normalization is to get rid of things like edge effects and these kinds of things while background normalization is there to prevent like hybridization. So all of these things are available in bioconductor. So if you ever get some microarray data and you think like, oh, these microarrays, when I look at the pictures, it looks a little bit crap, right? There's some spatial effects in there. There's some background effect. Then there are many, many different algorithms that you can use to pre-process microarray data. So for single channel arrays, people use MAS5 or RMA or GC-RMA. So this is a robust microarray average. And then GC also corrects for the GC content of the probe. When you use two channel arrays, people almost always use lowest. So lowest is just a normalization technique to get rid of these effects like background and spatial and other things. And so all of these tools are available. You can just download them from bioconductor. And most of these tools just take a cell file as input. So you just say, well, I have 10 of these cell files. Now background normalize all of them. So after you've done this, right, so after you've got your data back from the company, after you've done your normalization to get rid of any all of these background effects, what you generally end up with is a big matrix, right? So a big matrix on the columns, you have the different samples. And in the rows, you have your probes, right? So the samples, of course, have some sample annotations saying this is sample one. It's this tissue and it was measured or it had this treatment. Furthermore, you have all of your probes. So this head, this is a massive matrix generally containing like 100,000 or 200,000 lines or rows. And each of these little cells contains the raw intensity value telling you that this was the intensity for the spot on the microarray. So in two color microarrays, there's still something that we need to do, right? Because if we have a single color or if we have a single channel microarray, then the raw intensity values are the thing that we're working with. But if we have a two color microarray, we have a green and a red color at the same spot, right? Because the spot hybridizers or the spot, the DNA, which is or the DNA, which is labeled green and the DNA, which is labeled red, they compete with each other to bind to this spot. And the big issue here is the dynamic range. And when we look at the red color, the sci five, then this is a dynamic range, which ranges from like 5000, all the way up to 20,000. So it had this is just the intensity that it can produce when you hit it with a laser. Sci three is much better. It has a much larger dynamic range. The dynamic range is from like 2500 lumen, all the way up to 40,000 lumen, right? So to get rid of the fact that sci five actually has a much narrower dynamic range than sci three, we generally look at the ratio, right? So the ratio means that we take the mean of the intensity of channel one, so the red channel, and we divide it by the intensity of the green channel, right? So then we get a ratio. The problem is, is when we look at these ratios, right, then they kind of represent the colors, right? Because the red color here has an eight over one, the green color here has a one over eight, right? So you can see that there is kind of a good representation of the color by the ratio. The big issue here is that the steps are not the same size. If I just look at the ratio, right, going from one, which means both probes or both the probe, both DNAs kind of hybridize similarly to this probe. If I go from one to two over one, then this is a step which has a size of one. But if I go the other way, if I go from one to one over two, then the problem is, is that this is a step of only half. And then going from here, from one over two to one over four, this is a step size of a quarter. While going from two to four is a step size of two. So we want to linearly linearize these step sizes. So we want to have linear steps, right? We want to have an intensity change towards the red be equal to an intensity change towards the green. So to do that, we take the log two ratio. So we take the log two ratio of this ratio, right? So of the mean intensity of one divided by the mean intensity. And then what happens is that you see that two over one is a log two of two over one is one. Well, a log two of one over two is minus one. And now we see that stepping to the left or stepping to the right gives you an equal step size, right? So you see that this is plus one, and this is minus one. And this is minus one, and this is also nicely plus one. So it allows you to get rid of this, this ratio effect. So when we use two color micro rays, remember always express the intensities because you get two intensity values one for red, one for green, divide them by each other and then take the log two of it. So why do we do this? They have different intensity ranges, and this is called dye bias. So had dye bias is the imbalance between the red and the green channels dynamic range, because had the red dye doesn't work as well as the green one. Not only that, but by log transforming it, we improve the characteristics of the data distribution, right? It provides us with a linear step. So going from zero to one is the same step as going from zero to minus one, which is not the case when we're looking at ratios which are not log transformed. And often doing a log transformation makes the distribution more normal, more Gaussian, right? Because intensity values are a massive scale going from zero to like 40,000 intensity points. But when taking the log two of the ratio, what happens is that generally we see kind of a normal distribution, which has like a mean of seven and a half, the lowest intensity is around zero and the highest intensity is around 15. So it becomes a normal distribution, which allows us to use parametric statistics, and we don't have to use non parametric statistics. So that is the reason why we do a log two transformation of the ratio. So then we get the next normalization step, right? So normalization, the definition is the creation of shifted or scaled versions of a statistics or of a value. So the intention is that normalized values allow the comparison of different samples or data set in a way that eliminates certain gross influences, right? Because besides the influence of the fact that micro arrays have different performance under different temperatures and under different humidities, we also just have the simple fact that when I'm extracting RNA from a cell, and I'm extracting RNA from another cell, that the amount of RNA that I extract will not be exactly identical. And of course, if I have 10 units of RNA from sample one, and I have 9.8 units of RNA from sample two, then the whole intensity of array one will be higher than the intensity of array two, right? Because the input material is just less in array two. So when we talk about normalizations, I just want to mention that there are two different types of normalizations. So the first type is the normalization of ratings, right? Which means that you are adjusting values measured on different scales to a common scale. And so you adjust the intention here is to bring the probability distribution into alignment between the two, right? And you can think about this. When we have a company, for example, then a company can, for example, have an amount of profit, right? So that can range from zero to like 10 billion or something. But a company also has a number of employees, but no company will have 10 billion employees, right? So when we want to make a single value of merit for a company, we want to say, well, this company earns and this company performs that and we want to not only do this for like employees and profit and these kinds of things like like economic impact. And so if we want to find a single value of merit, then we have to make sure that all of these things that we are kind of combining together have the same range. Otherwise, the profit would have a much higher impact than the number of employees just because the range is bigger. Besides that, we also have normalization of scores. So normalization of scores is different. The goal of it is to take a distribution which is not normal and make it a normal distribution, right? So normalization of ratings is if you think about the company is when you have different things measured and all of these things have a different kind of scale, but normalization of scores is making a distribution which is not a normal distribution like the intensity of a microarray, right, which generally doesn't follow a normal distribution and make it into a normal. And that is why we do the log two transformation. So a log two transformation or a log two normalization is a normalization of scores and not a normalization of ratings. So how do we do normalization in microarrays in R? Nowadays, I think almost everyone uses quantile normalization. So quantile normalization is a normalization technique which looks at the individual microarrays, determines what the overall mean should be and then start shifting microarrays towards these mean making so that in the end we have a microarray. So every microarray that we did has the exact same mean, the same maximum value and the same minimum value and everything follows more or less a nice normal distribution although this normal distribution has a long tail, right? And in this case, we normalize this matrix of expression values that we get. So we normalize across samples and samples should be in the columns. So pre-process score is a library that can do this for us. It's available on bioconductor and the function is called normalize.quantiles. So if I have my expression matrix, right? So this big matrix with the raw intensity values, or not the raw intensity values, but the log two transformed intensity values, then on the log two transformed intensity values, I can say normalize.quantiles. And then what will happen? It will take these microarrays. So here we see the microarrays before quantum normalization and then normalize the values so that every array has the same mean. You have to be aware that this might remove real biological variation. If you look very closely at the axis here, right? This is data that we did in our own lab. We see there that we have HT and GF and then we have HT again and then we have GF again. And you can see HT stands for hypothalamus. So a part of the brain. GF stands for gonadal fat. So fat, which is near your gonads. And we can see here that all of the HT arrays have a higher intensity on average. All of the gonadal fat, all of the gonadal fats have a lower intensity. And this is very logical because there are more genes expressed in brain than there are in fat because brain is just a much more active tissue. It has much more genes expressed. So by just doing a quantum normalization across the arrays and saying, no, every array should have the same mean, we are removing some real biological variation. And this real biological variation is that in brain, there are more genes that are active than in gonadal fat. So we should normally compensate for that. But if we want to compare across brain and gonadal tissue, if we want to see which genes are active in brain and are these genes also active in fat tissue, then we cannot do anything else than remove this kind of global biological effect. So normalization is there to get rid of unwanted effects, but normalization can also remove real biological effects. So that's something that you have to be aware of. All right. So then the next step after normalization is to do statistical analysis, right? So the goal is to identify genes which are differentially expressed between groups. For example, in our analysis here, we might want to look at how hypothalamus genes or which genes are different are upregulated in brain or which genes are downregulated in brain, right? So and of course, we can use many different statistical methods to analyze differentially expressed genes. If we have a very basic setup where we have two kind of different tissues, right? So we have, for example, treated brain and untreated brain, then we can just do a t-test. If our experiment is more complex, if for example we have brain tissue which has been treated at dosage level A, and then we have brain tissue which has been treated at a higher dosage level and so if we have more groups than two, then we have to use something like ANOVA or Ranked Products. So ANOVA, of course, is a parametric analysis tool which means that it assumes that your data is a normal distribution while Ranked Products is a non-parametric analysis. So it doesn't assume that your data follows a normal distribution. So each statistical test has its own advantages and disadvantages. An ANOVA is much more powerful than a Ranked Product because it can assume that there is a normal distribution. But a Ranked Product is much more insensitive to outliers, for example. So if there are outliers, hey, you would not want to use an ANOVA but you want to use a Ranked Product. And of course, there's literally hundreds and hundreds of different statistical tests and people are more or less inventing new ways of doing statistical tests on microarray data every week. So I can't show you all of them. But I just wanted to show you a couple and show you how you do it in R because you will be doing it for the assignments. So the first one is the student t-test. I think that everyone knows the student t-test. And like when I was in high school, I had to do student t-test by hand, right? So you have two groups from the first group or you calculate the mean and the standard deviation. From the second group, you calculate the mean and the standard deviation. Then you throw it into this formula. You get a t-statistic and then you have these little books where you can then look up, okay, so I had this many samples and I wanted to have this level of significance. And then it would show you the t-statistics. And if your t-statistic that you computed was higher than the one listed in the table, then you knew that your effect was significant. Nowadays, we don't really do that anymore, but anyway. So t-tests were developed by William Sealy Gossett and William Sealy Gossett, seen here, actually always published under the pseudonym student. And that's one of the nice things in science, like if you have not written any papers, you can publish under any name that you want. So you are free to choose your name in science just like you are in music, right? Puff Daddy is not called Puff Daddy by its parents. He just came up like, I want to be a musician and people should call me Puff Daddy. The same thing holds in science because you can decide what your name is. So before you write your first publication, think really, really hard how you want to be known in science. And actually William Gossett decided that he wanted to be called student. So all of his papers, the author name is student. In R, it's the t-test function. So the t-test function is just doing a t-test. And you can do two types of t-tests. Actually, you can do much more types of t-tests, but the two most important ones are single-sided t-tests, where you have the hypothesis that one of the groups is lower than the other group. And this is a much more powerful test than doing a two-sided t-test. Because in a two-sided t-test, you just say, I don't know what I expect. I just want to see if group A is different from group B. And of course, this is not as strong of a question as I assume group A is lower than group B. So a single-sided t-test allows you to, if you have a prior hypothesis, or if you're saying, no, I'm interested in genes which are down-regulated. Like, I don't care about the up-regulated genes. If you want to have down-regulated genes, then you can say, I'm going to do a single-sided t-test that gives you more power compared to a two-sided t-test where you say, well, I might want to have up-regulated and down-regulated genes. If you want to do analysis of variance, which is the second way, right, using an ANOVA, then you can do this as well. And the nice thing is, is that an ANOVA or a linear model allows you to adjust for different covariates. So covariates are factors or groups that affect the gene expression, which are not of direct interest for you, right? So they can be factors or groups, like I have mice, I have 20 mice, 10 of them are male, 10 of them are female. And from these 10 males, five got treated, and five didn't got treated the same for the females, right? So now we have four groups. We have treated males, untreated males, treated females, untreated females. And a t-test can only compare two groups, so we have to switch to using an ANOVA. And of course, the more groups you make, the more complex these models become. And not only do we have grouping effects, we might also have like quantitative differences, like concentration differences, right? We might have 100 humans, some of them got a dose at 10 nanograms, others got a 100 nanograms per liter dose, and others got a million nanograms per liter dose. So in R, we have to do two things. We first have to create our linear model using the LM function. And then once we've created our model, we can then get the significance of this model, so the significance of the individual factors that we put into the model using the ANOVA function. So first we need to build a linear model. So a linear model is structured like this. You have your response, right? So the response that we are interested in in gene expression is of course the expression level of a gene, right? Then we have to say the expression level of our gene is determined by some covariates. So covariates are things that we want to compensate for, but we are not directly interested in. So things like sex or age or weight of the animal. And then we have our predictor, right? So the thing that we are investigating, for example, I could have two different diets, or I could be interested in disease versus healthy, or I could be interested in treated versus non-treated or treated concentration one, so there can be different things. But the linear model we build, so we say that our response is determined by the covariates, then we have a predictor, and then we have an error term, because of course our model will not fit 100% to the data. So using R, for example, if I have the expression of a single gene, I can model this and say, well, I have a factor called sex, and I have a factor called diet. And now I'm saying to R, I'm interested in the diet, and not in the sex, because the sex is something that I know will cause differences, but I'm not really interested in the sex differences. Here I'm interested in the fact if my diet made a big difference in the expression of my gene. And then I just save this in my model, and then when I want to get the significance of the sex effect, and I want to get the significance of the diet effect, then I can just do an ANOVA of my model, and then this will give me the P values. So I built you guys a little example in R, so here we're looking at three different strains of mice, so the standard laboratory strain are Berlin Fetmau's and an F1, so that is a cross between a B6 mother and a BFMI father. We measure gene expression in two different tissues, so we can't use a standard T test, right, because a standard T test can only compare two things, but we have of course six things, right, we have B6 tissue one, B6 tissue two, BFMI tissue one, BFMI tissue two, F1 tissue one, F1 tissue two. So here we can create a linear model saying that the expression of my gene is determined by the tissue. I'm not really interested in the tissue, because I don't want to know, I don't want to know where brain is different from fat. I want to know what is different between my mice strains, right, so I'm saying that my expression is determined by tissue, which is of course more or less the covariate here, and then I look at the strain to see if there's a strain effect. Then when I do the ANOVA of this model, and I just took one of these ANOVA tables, then I can see that in this case, the expression of my gene was significantly affected by the tissue, but it was not different for the different strains, so the expression after correcting for if I'm looking in brain or if I'm looking in fat, there is no difference anymore between the B6, the BFMI, and the F1. So this is how it looks like. Here you see the degrees of freedom, so that has to do with how much it takes, and here you see the residuals, and this is kind of a measurement of how much power you have. Good, then we take a short break, so if you're watching this on YouTube, I will see you in the next movie, and if you're watching this on Twitch, then just stick around, and after the break I will be right back.