 Welcome, everyone, also on Moodle, if you're watching it later. Today we will have gene expression analysis. So gene expression analysis is the analysis of genes and the expression of them, and figuring out how active the genome is at certain points. So don't forget to register for the exam. I put this slide in, like, every time the exam will be on March 4th, so make sure that two weeks before you are registered otherwise I cannot let you participate, which is always a shame. There's always this one guy that forgot to register and really wants to do the exam and has to wait for the exam, so don't be that guy. Good, so register now if you can. If you have any problems with registration, let me know as soon as possible, and then I can argue with the pre-functions bureau. All right, so this is the overview for today, so we will be talking about experimental design, some questions that we can answer using gene expression data or questions or common questions that people want to have answered when they do these kinds of experiments. We will talk about some normalization, the difference between different types of normalization. We will talk about statistical analysis, of course, like the gene expression profiling and multiple testing issues that come with measuring 50,000 genes. Hey, Commando, welcome to the stream. I will talk a little bit about genontology and pathway analysis, and I want to talk a little bit more about visualizations and stuff. But of course, before we start the lecture, we have to go and do the assignments from last week. So the assignments from last week were all regarding sequence analysis. I hope they weren't too hard. It was more like copy pasting the things in and kind of figuring out what the individual buttons do for the cluster W. I, hey, Alexander, welcome to the stream as well. So yeah, like turning these knobs on the different parameters that you can set is having a big effect in the end on different alignments that you do between sequences. The thing here is that I asked you guys to do cluster W, and then in the assignments, it had like an old link. So I hope that everyone was able to find a website where they offer to do cluster W, especially since it's a little bit older, right? Nowadays, everyone uses cluster omega, which uses kind of AI to figure out the best kind of parameter options, or it has like this smart learning technique to figure out what's going on. So, but let's just start. So in the assignments, we had several sequences that we wanted to analyze. So we had one query sequence and then four different sequences of beta lacta mass, which is a protein which is breaking down lactose. And there was also a dehydrase in there. But the idea was that you would just take the sequences and throw them into cluster W. So let me show you guys my Firefox window. I actually just threw them in. So I was using the genome.jpa cluster W tool just because the one that I gave you guys a link to was not available. And I already just copy pasted them in, right? Because it can take a little while for these things to run. So just do not waste any time today on waiting for websites loading. I thought it would be a good idea to just put them in. Yes, so you can see here that it received five different sequences. It lists the length of the sequence. So all of these are protein sequences. It started pairwise alignment and then hey, it aligned all of the different things to each other. So it just did like a matrix of one versus two, one versus three and so on. And then in the end, it shows you here the alignment file. So the first question was to which sequence is our query sequence most similar? So for this, it's really hard to see that, right? Because you can't really see that here. This is more or less the graphical overview of the tree that is being produced when it aligns these sequences together. But the thing is you can select here the tree menu. So you can select different options to cluster them. So it just took the standard one. And then when you do that, it shows you this picture and this picture is more or less what you need to answer the first question. So answering the first question is that the query sequence is around 96% similar or it has a very small distance to the beta-lactamase-2 found in the bacillus sp1 species. So that was the first question. So to which sequence is our query sequence most similar? So that it is most related to the beta-lactamase-2 and then the next one is the beta-lactamase-2 of the bacteroids F. And then of course, here you have the other kind of tree which is the beta-lactamase precursor in the day heterogenase. But all right, so the second question, how many amino acids are identical between all sequences? So if we just look at the alignment, we can see that every time we see a star, it means that all of the amino acids were similar at that position. So in this case, it's just counting up all the stars. I didn't count up all the stars but that's a question for you guys but it's like one, two, three, three or something. So you can see that these sequences are relatively diverse. You can also see that for like the large regions where there's actually no real overlap. Hey, so you see that the first two sequences are very similar when you look at it. But then if you look at the third sequence, then you see that that is kind of different, right? You see that there are like little blocks at which this third sequence here is similar to the first two sequences. But it's almost a little bit like trash in, trash out, right? Because the alignment is not that good. Of course, the alignment between the first two is relatively good. The third one also seems to fit, so the bacteroids. But the precursor itself is not really looking very similar. So question 3A, how many amino acids are highly conserved between all of the sequences? So here we have to count all the double points because double points means they are not the exact same amino acid, but it's an amino acid which has a very similar kind of structure. Biochemically speaking, they are very similar. And this, of course, has to do with which matrix you use to kind of define how amino acids are related to each other. And if you take a Blossom matrix or a Palm matrix, this, of course, has an influence on what the algorithm considers to be similar and what it considers to be different. So, but yeah, just add up all the double point ones and those are the ones which are highly conserved. Of course, we have the ones which are weakly conserved as well, but there wasn't a question about that, but those are the dots in the sequence. You can also see that it actually introduces quite large gaps at some positions just because this beta-lactamase precursor sequence is quite different from the other ones. Hey, of course, it still has some similarity, but there are really big gaps in the beta-lactamase two compared to the beta-lactamase precursor protein. All right, so then the next question was to change all the parameters one by one, put them extremely low, put them extremely high. And that's normally when you want to figure out what something is doing, just change everything or everything one by one, right? So, hey, you take the first parameter, you put in a very low value compared to the standard, and then, hey, you look to see what the difference is, and then you put a really high value in it. So the question here was, can you find parameters at which the alignment is complete nonsensical? And of course, you can find parameters at which the alignment is completely nonsensical. So I hope that everyone found a parameter set for which the alignments did not make sense, but had the idea was that in the process of changing all of these parameters to see what they more or less do, and so use a different matrix for the substitution probabilities, had used extreme settings for gap opening and gap closing penalties, and because if you set a very high gap opening penalty, and then of course the gap that is introduced at this point, of course, cannot be introduced. It would rather go with all of the mismatches than introduce a gap there. All right, then question number four was a little bit more of you guys having to deal with ensemble, because I think that ensemble is one of these websites which is quite fundamental in bioinformatics. It's kind of the core where everyone kind of gets their information from, and it's like one of these central starting points, but it is good to practice a lot with it, right? So the first step was to download the myostatin gene, the DNA and the protein sequence. So when I say the DNA sequence, I of course mean the coding sequence, so the cDNA sequence, from ensemble for humans, gorillas, mouse, pig, chickens, and species of your own interest, so I bet many of you guys went for a fish, and I don't know, do fish actually have a myostatin gene or do they regulate growth of muscles differently than other animals? Anyway, you could just add a fish or you could say, well, I take a dolphin, right? Because dolphins, they're still mammals that are relatively close to humans and gorillas and mice, but... So of course the first thing that you need to do is search for the myostatin gene, so let me just copy it, so I'm just going to search, and I have been having some issues with ensemble, and I figured out that all of the issues that I've been having, because like last week also it was really slow, and I figured out that if I just force it to use the HTTPS site, so that if I just go and I type in ensemble, for some reason, it redirects me to the unprotected site, and the unprotected site seems to be really slow compared to the kind of HTTPS site. I don't know why that is, it should actually force you to kind of upgrade to HTTPS, but it's probably a misconfiguration at their end that they serve out the webpage over an unsecure line, which in theory is okay, right, because like who's going to man in the middle a query to ensemble, but in theory they could. But by changing it to HTTPS, it turned out that it was much, much faster in searching and these kinds of things, so I think they just have a misconfiguration. All right, so go to the myostatin gene, it's called MSTN, it of course has GDF8 for growth dependent factor eight as a synonym, and then of course, if you want to export data, use the export data button, and in this case, we want to always have faster sequences, we want to have the feature strand, it doesn't make sense to go the other way around, so for this I would say just use it unmask, because we're only interested in the CDNA and not so much in the coding sequence, not so much in the exon, but we want to have the peptide sequence as well, right the protein sequence. They call it a peptide here, which is more accurate in a way, we had the lecture about proteins, amino acids, peptides and the differences, so if you talk about a protein without the co-factors, then it's called a peptide. All right, and then we just click next, then we want to say, well, we just want to have it in text, and then how we get the coding sequence here, so the CDNA, and then we have the peptide, and then here we have the chromosomal DNA, and of course, the chromosomal DNA is not that interesting because it also contains the introns, we're not really interested in the introns. So you select the first two parts, right, and you just copy-paste it to a text document, so let me do that, and of course you want to separate the DNA sequences from the protein sequences, right? Clustle or alignments cannot deal with two different sequence types, and you repeat that every time. So after you've built up your file which has the six protein sequences for the human, gorilla, mouse, pig, chicken, and your species of interest, so that's actually seven, the idea was to just use Clustle W to find the overlap in the DNA sequence similarity and the overlap in the protein sequence similarity. So the idea was, yeah, the question here is, is there more selective pressure to conserve a DNA sequence or a protein sequence between different species? So if you did this, and I'm not going to download all of them and run it for you guys, but the idea is that DNA is able to change more than the proteins are, and the protein is something that has an effect or function, and so although certain amino acids can change, if there's too much change, then it doesn't work, but since every amino acid is coded by three base pairs and the third base pair is the wobble base, which is not that interesting, or it kind of allows, it is allowed to vary, and the DNA sequence is not as conserved as the protein sequence is. So that's the answer to the fifth question. So is there more selective pressure to conserve DNA sequence or protein sequence? There is more pressure to preserve the protein sequences. All right, so if there's any questions, then let me know. I think the assignments were relatively easy because it's kind of copy-paste and just looking at what happens and changing some parameters. So I was able to, or I hope that everyone was able to kind of do the assignments and got more or less an idea of what the idea was that you were supposed to learn. So the thing that you are supposed to learn is a little bit more practicing, downloading stuff from ensemble, and a little bit more practicing with looking at how alignments and how certain parameters of the alignment kind of have an influence on the eventual alignment. So in a case where I would make an exam question saying, I have an alignment, what would happen with the alignment that you see here if I put the gap opening penalty to 1500, and then you should be able to kind of answer this question and say, well, if you would say at the gap opening penalty to 1500, and then it would not allow you to open gaps. So then it would kind of start misaligning after the first gap that we see in the original alignment. All right, so don't see any questions in chat. So we will continue then with the lecture for today. So the overview for today, let me switch to it myself. So the overview we already discussed. So let's just start. I think it's gonna be a short lecture today. I do have a lot of slides, 51, but I think we can be relatively quick. Would you like to do this in practice with Biomart or online? How do you mean? Exporting the sequences of Myostethan. I would do it using Biomart. So that would be my preferred way of doing it because I like programming in R, right? So I do everything in R. I would also do the alignments in R because there's also a package called MSA for multiple sequence alignments in R. So if anyone is interested, I can show you guys the, if anyone's interested, I could show you the little code that I made to do like multiple sequence alignments of the SARS-CoV-2 protein. So, because at the beginning, everyone was interested in SARS and where it came from and so I made a little script, which downloads all of these. All right, good. So then we will look at it after the lecture, right? I don't think that the people watching it on Moodle are that interested to kind of switch, but hey, if we keep the lecture short, then we have like probably like 45 minutes at the end where we can just do some R coding and look at SARS-CoV-2 and some particularities of the virus. The warning in advance is real. The assignments are tough because I want you guys to do a little bit in R, right? We had the R introduction lecture, so the assignments for today have a lot of R. I gave you guys a subset of one of the microarray data sets that we did and I want you guys to kind of struggle and kind of see if you can figure it out for yourself. So the assignments are difficult and I know that from previous years and of course if you have any question and you get stuck, then just ask me by email and I can just help you guys figure it out. And so, but do try it first yourself. It's not like if you type a little bit of code and after like two minutes you're like, ah, I don't know what to do. And then try for like half an hour and if after half an hour you're still stuck, then just send me an email and I will be monitoring my email during the weekend. So I really hope that people just send in their questions and I can help you guys. If I get like people getting stuck at the same point, then I might just put something or send something to everyone via the Moodle thing. So because then we can just use the mailing list. If I get the same question three times, then I will just mail everyone back. All right. So first question about like experimental design, right? Why would you wanna do a microarray experiment? Why would you wanna measure the expression level of genes, right? Because we know that the expression of genes and like in the end it's not about the gene expression. It's not about how much mRNA is produced. It's about how many proteins are produced, right? So the questions here are, which genes are differentially expressed between healthy and disease tissue? And this is something that we work with a lot. So if you have a certain disease or for example, a certain cancer, then we want to know which genes are different in the cancer tissue compared to healthy tissue. Because if we know what is different, then we can design like therapeutic targets that either bring down the expression of a certain gene or we can develop therapeutica that will target cells which have a certain level of expression of a certain gene. And for example, force them into apoptosis. And of course there's more fundamental things that you can learn. And so for example, in cancer research, one of the major things that came out when you look at cancer tissue versus other tissues is that usually cell cycle control genes seem to be broken. So there are normally like three or four checkpoints when a cell replicates and duplicates. And a lot of these genes that normally would become active during the cell cycle, during these checkpoints had to make sure for example that the DNA is properly duplicated. These genes do not become active in cancer tissue or they become overactive in cancer tissue, allowing the cancer to just blast through the cell cycle and just multiply like mad. One of the other things is which genes are expressed in which tissue? And of course it's very interesting, especially like we're doing a lot here with research about different tissues that are involved in like obesity and the metabolic syndrome. So we are looking a lot like things like what is happening in the fat, what's happening in the muscles, and what's happening in the liver and the pancreas and these kinds of things. And then we try to kind of look to see what, if we have different mouse strains to see which genes are differentially expressed between the different strains but also between the different tissues in the hope that we learn something about what genetic factors are underlying like obesity and metabolic syndrome. Yeah, because of course it's nice that when you suffer from obesity that you go to a doctor and the doctor says, well you should eat less or exercise more but therefore some people that doesn't work because they just have a genetic predisposition to become fat. And if you have a genetic predisposition then you wanna know on a molecular level what's going wrong to see if you can kind of help these people this way. One of the other questions which is kind of important is how does gene expression change after the administration of a drug or other treatment? And this is of course a major field of research. So we can kind of look into that as well. So microarrays, very short history overviews. They were developed in like the 1980s or starting 1980s. And of course they weren't called microarrays. Back then they were more or less macroarrays. So they were like these glass plates which were nine by 12 centimeters. And you would have like a very early 3D printer kind of thing which well, it would not work in a Z level. So it's kind of a 2D printer. And this printer would be loaded with all kinds of different probes. We still use these probes or these oligos. And what this printer would do it would just leave little dots of very accurately measured amounts of these probes and it would print them on this little glass slide. And then in the 1980s, we were able to make these little glass slides nine by 12 centimeters. And you would have a little dot every like half centimeter. That would mean like 18 dots by like 24 dots. So you would be able to measure like 100 to 200 genes. And so a microarray in itself is nothing more than a collection of microscopic or macroscopic DNA dots attached to a solid surface. And each spot in current microarrays are the ones that you can buy currently. They contain around 10 to the minus 12 moles of a specific DNA sequence. And this DNA sequence is called a probe. Although some people also call them reporters or oligos. So they are different scientists that have different names. And they have because the word probes is sometimes also used for like markers in the genome. And so if you really want to be accurate, they should be called oligos because they're oligonucleotides and they're not longer than like 30 base pairs on average. So that would still classify as an oligonucleotide. So the idea is that they target labeled CDNA and the labeling of CDNA can be done using a fluorophore, which is more or less the most common nowadays. But in the old days, people also used silver labeled or chemiluminescence. So they would use like a chemical luminescence to make the probes light up. And then what is done is that you add these, hey, you add your colored DNA of your sample to your microarray and the DNA kind of swims over. And if there's a match between the oligo on the array and a sequence which is in the labeled DNA, they will bind together. And then depending on the intensity that you see, you know how much binding there was. So hybridization, so the coupling of the DNA of the target with the oligo is detected by an intensity and it gives you a relative abundance of the target sequence. And the relative part here is important because we will get back to that because normally you would want to get an absolute quantification, but microarrays generally don't give you that. So a very short workflow overview. So how would you do that? So you have your sample, for example, I have a certain bacterial culture. And so I pick some of these colonies and I do some purification, right? Because hey, I have to kind of centrifuge them down. And so, or I separate them using water and phenol. So if you have water mixed with phenol, then those will kind of float on top of each other, which is a very common. And the mRNA will be in the water phase while the proteins and the DNA will be left in the phenol phase. Hey, you then take the water phase with the mRNA, you add a reverse transcriptase, which makes cDNA, of course because you can only use DNA to bind to DNA, which is called the reverse transcriptase step or the RT step. And then in the next step, you have the coupling and the coupling just means that you add a dye to your label, to your cDNA, right? So you have your cDNA, you add a dye, normally we use c3 or c5. So c5 is a red color, a red fluorophore and c3, psi3 is a green fluorophore. And then you label your DNA. The next step is, is that you hybridize it to the slide, right? So you just pipe it your sample on this little glass plate and then you wash it. There's a time that you use to have it hybridize and after it's hybridized, you wash away the DNA which had not hybridized to the array. You put it in one of these machines and the machine has a laser in there which kind of lights up the fluorophore and it scans the intensity of the fluorophore back. And then in the end, you get like an intensity ratio plot and you have to do normalization and analysis which of course happens in a computer. And so hybridization, this is the crucial step. Hybridization is of course based on the fact that complementary DNA sequences kind of bind together in a, how do you call it, in a way that you can kind of undo. There's a nice word for that. Anyway, but they bind together based on hydrogen bonds. So it's not a binding which is always there or which is, it's a reversible binding. So based on the temperature, you can reverse the binding and that is because it's hydrogen bonding and so these things, they kind of fit together. And of course, the more complementary a base pair is the more tighter this bound is between the two DNA probe or between the target DNA and the oligo on the array. So if you have a lot of mismatches then the binding won't be as strong and that is why you have the washing step because in the washing step you want to get rid of all of the DNA that has not been properly bound. And of course that has an influence on how accurate the microarray and the binding is. But the hybridization, remember it's always based on complementary binding based on the hydrogen bonds. And again, here you have the issue that a CG pair has three hydrogen bonds and an AT pair has only two hydrogen bonds. So an AT base pair binding is just weaker than a CG base pair binding and you have to take that into account. And especially when you do the analysis and you start like looking at the intensity then also the intensity is of course affected by the number of A's and T's versus the number of C's and G's in the probe that you are using or the oligo that you are using on the array. But this is more or less how it looks. So there are two types of microarray, you have single channel microarray of one channel microarray and it provides the intensity data for each probe or probe set indicating a relative level of hybridization with the target label. And then you have two channel microarrays which are where you can take two different samples. And for example, cancer cells on the one hand, normal cells on the other hand. Hey, you label the one with the red color. You label the other one with the green color. Hey, you combine them and then you hybridize the combined sample. Of course, here when you combine them, hey, you want to combine them based on the fact that you have a certain concentration because you want concentrations more or less to match. But the two channel microarray is often used when you compare disease tissue versus healthy tissue. And that is also why we have these two complementary colors, Psi three and Psi five, because that allows us to kind of view if a gene is highly expressed in the cancer cells or if it's higher expressed in the normal cells. And so this is more or less how a scan microarray looks. So this is just one of these TIFF files as an example. And then when you zoom in, so here you see one, two, three, four, five, six, seven, eight, nine, 10, 11, 12. So this is 12 by four. So this is a 48 slide microarray. So there's 48 little microarrays on a single glass plate. But each of these arrays targets an X number of genes. And of course, this is just a cut out because normally these single microarray contains like 20,000 or 200,000 little dots. But all of the microarrays, they come in slides. A slide has multiple microarrays on there. And that is important when you plan your experiment because when you plan your experiment, you have to kind of keep in the back of your mind that you always pay per slide and not per microarray. So hey, if you have a type of microarray which comes on a 96 microarray slide, that means that you have to do 96 because you pay for the whole thing. So if you only have 30 samples, then 64 slots on the array or on the slide are not used. And generally when people talk about a microarray, they mean like the single kind of array which is on there. So the single little matrix. And when they talk about a slide, then they talk about the whole thing. But there's always a ratio, right? So that's why you see in a lot of papers that deal with gene expression that people do like 94 samples and not 100. And people always wonder when they read the papers, why didn't you do like 100 samples or why didn't you do like 20? Why did you do 16? And this of course has to do with the fact that microarrays just come in like a rectangular layout. And often it's like four microarrays in the rows and an X number in the columns. So the most common ones are 16, 32, 64 and 96. All right, so I had the applications. We already talked a little bit about the questions that you can answer when you have microarray data or when you're investigating microarray data. But some of the applications where people use it is for example, comparative hybridization to compare the activity of the DNA in one sample versus the other sample. But also when you are comparing a new unknown species to an old species. So you can also use it to kind of do like a DNA profiling. And so if you're in the rainforest and you just find like this little bug where you kind of don't know what the bug is or in which clade it belongs, then you can hybridize it to different known samples. And from that you can kind of learn which type of clade of animals you need to look. More generally it's used for expression profiling. So the whole RNA or messenger RNA is extracted from a sample. Hey, it is reverse transcribed. And then a poly T primer is used if you want to only amplify the mRNA because mRNA ends in a poly A tail. So then you have a poly T primer which allows you to amplify only mRNA. But you can also use like random primers or random hexameric primers which amplify all of the RNA. But then you're also amplifying things like ribosomal RNA which you're generally not that interested in. But if you are interested in micro RNAs or short non-coding RNAs or any of the other types of RNA which are not messenger RNA, then of course, hey, you want to use these random primers. But do be aware that if you ever want to do like micro RNA experiments, then smiling is not a proper one. Unfortunately, I can. But hey, if you're, so there's a big difference in the efficiency if you're using poly T primers because then you only get mRNA and if you use random primers, then the random primers will amplify all of the RNA. But then like most of it will be our RNA, so ribosomal RNA, which is in high amounts in all of the samples. Another application is SNP genotyping by having two probes which more or less have a difference of one base pair. Hey, you can hybridize a sample to these. Hey, then you get two intensities, one for the variant, for example, which has the A base pair, one for the other one, which has the G base pair. And then depending on what the ratio is between the A and the G intensities, you can say if a sample is homozygous AA, if it is heterozygous AG, or if it is homozygous GG. So you can do SNP genotyping and this is one of the things that is happening a lot. Hesso, expression profiling and SNP genotyping are kind of the two main things that people do with microarrays. And SNP genotyping will kind of cost you like 80 to 200 euros per sample. So then for 80 to 200 euros, you get an idea of what genetic makeup is. So which single nucleotide polymorphisms there are in the genome, so you can kind of build up a genetic map. Hesso, a lot of these, smile is also not one. Let me see what is the closest one for that. Let me open up the engine. All right, so it is actually, so it's not called smiling, it's called grinning. If you wanna change your thing. I'm sorry, I should have put in a lot more. But Hesso, that's one of these major, these are the two major things that people do. So they use it for expression profiling and SNP genotyping. And SNP genotyping is something that you can, for example, do when you buy one of these like 23andMe or Hereditary or these kinds of things, right? Then what they do is they don't sequence your genome. Generally what they do is they do SNP genotyping with like a high density array. So they measure your DNA at 200,000 little points. And then based on the known genome sequence, what they do is they then kind of impute the missing data, right? So if you know that, well, in our database, we have like an individual which has an A at a certain position and then it has a T afterwards. And then you can of course infer what the base pairs should have been between the A and the T if you have a large amount of data from like random people. One of the more newer applications which is kind of not used that much because in this case, if you wanna look at epigenetic effects like modifications which are done on the DNA, then generally people go to DNA sequencing or RNA sequencing. But you can do like, you can look at if a region in the genome is methylated or if it's not methylated. And so if you have a G base pair or a GC, then the G of this can have a methylation group on there. So a methyl group on there. And you can use microarrays to kind of get a genome-wide overview of which areas of the genome are methylated and kind of not accessible and which of the regions of the genome are not methylated and are accessible for expression. So that's kind of a slightly different step. But you do that using chip-on-chip kind of methods where you first use a particular protein. So DNA RNA bound to a particular protein is immunoprecipitated. And then you do an epigenetic or a regulation study. So all of these things can be done using microarrays. But the most two most common ones are expression profiling and SNP genotyping and comparative hybridization is still used a lot when you just wanna compare two tissues or two other things to each other. All right, so the common microarray workflow consists first of, I think I showed you guys this slide already at the beginning, but I'm just gonna go through it. And so this is more or less how you do it. So first you have to create your microarray if there's no microarray available. So like for a lot of species like humans and mouse and cows and these kinds of things, you can just buy a microarray. But if you're working on some kind of tropical fish that you are studying plus five other people in the world and then of course there won't be a microarray specific for your type of fish or your type of lizard. And so then you need to create your own oligo arrays and creating the oligo arrays is very similar to doing primer design. So we have our primer design lecture and of course have for oligo arrays the same thing or the same kind of parameters are important as are for when you are doing an oligo array. So hybridization temperatures and stuff had all of these things you have to take care of. But fortunately many of the big microarray manufacturers had to have like a nice web interface where you can submit your probes that you want to put on the array and then they will tell you, okay, so then they do kind of a primer three check of all of your probes to see if they are compatible with each other and they will suggest alternatives. But it is one of the fields of bioinformatics and like currently we in our group, we are, we have designed a new microarray which does SNP genotyping for a very specific type of cattle called Deutschfahrtspundes Niedrungtrend, which is a very small breed. There's like three to 4,000 animals here in Germany and a couple more outside of Germany. But this breed is very small. So for this breed, there's not a very specific array. And there are cow arrays, but just using the cow arrays, you kind of miss the thing that you want to see because these cattle are interesting because they have genetic variants which do not occur in the standard Holstein-Friesian cow. So that's why we designed a microarray. So that's science where you can write a paper on how you design the arrays. And, but hey, in the end you get a file which is called a TDT file, which is more or less a description of how the probes on the array should look like and a company can use this TDT file to kind of, kind of instruct the spotter to make these microarrays. And then you have the biological part. So the biological part is to acquire your samples, extract TDT. And so you just get a PhD student that does that for you in the lab. And then, of course, the hybridization and scanning is something which Bioinformatics is involved with again because hybridization and scanning is something that is also like, you have to have software which goes from having a TIF file and which reads the laser intensities. So there's also a field. Had people working at companies that make these machines, had that are involved in like image processing and had image matching. And had there's a lot of things that you can do because you need to do like dot detection of these arrays. And so there's a whole bunch of additional steps where still a lot of optimization can take place. Then the data storage is done in a cell file and cell files are the files that you usually distribute. So if you have done an experiment and you want to share it or you wrote a paper about your experiment and the journal says, well, you need to make your data publicly available and then generally you are sharing the cell files because the cell files contain kind of the information on the probes, like what is the composition of the probes? What were the intensities? But they also have a whole bunch of other metrics that is normally lost when you go to a TXT file in the next step. And so what you then do as a bioinformatician, you take several of these cell files and then you extract the expression levels from the cell file. Hey, you then have to do data normalization, which is also just a standard text file. The next step is gene expression clustering where you cluster samples together or you cluster genes together to see if there are patterns in your data, like is a certain, are the immune genes upregulated or is it all of the genes involved in like muscle development, which are up or down-regulated. And then of course the next step is data interpretation, which doesn't really have a file type, so I just call it a TXT file. Data interpretation is normally done in like a Word document where you just write down, okay, so this is what I see and these are the patterns and this is what I want to do with it. All right, so bioinformatics is, and like I told you, the design of the oligomere sort of is very much similar to primer design. It's highly specific. The primers should target only the sequence that you're interested in. They are not allowed to have any interactions with themselves. They cannot form hairpins. And so if you're interested in, or if you want to design a, or have to design a microarray in the future, then of course the primer design lecture will help you to kind of have an overview of what you should look out for and which mistakes you shouldn't make. And then of course there's the image processing software like spot identification and the calculation of the different intensities where also bioinformatics is involved and the analysis part and the statistics are of course also the bioinformatic part and bioinformatic responsibility when you do a microarray design. All right, so the analysis of microarrays is more or less three groups. So the first, or is three steps. The first step is the normalization. So normalization is done because you need to compensate for a lot of artifacts which are not due to biology but which are due to the way that microarrays work. So one of the most annoying things, I wouldn't say annoying, but one of the things which is common in microarrays is that the behavior of C3 and Cy5, they are not similar. The intensities that you get from a Cy3 probe are different. So the dynamic range of the two dyes are very different, so you have to compensate for that. There's a lot of variation that can happen during the hybridization phase. For example, if you do your hybridization and the temperature outside is five degrees warmer and inside of the lab, you can have an airco and you can kind of try and keep it exactly at 20 degrees, but there will always be little temperature variations, little temperature variations in like air humidity. And of course, that makes a microarray scanned on day one slightly different than a microarray scanned on day three. And of course, one of the other variations which is very common is the target DNA quality and quantity, because hey, if your target DNA has a very high quantity, then of course you should put less on the array, but again, these things are not like 100%, you cannot be 100% accurate. The quality also matters a lot, like if your DNA is a little bit fragmented or if you did a reverse transcriptase step, which went wrong, so hey, it didn't really amplify like sequences properly. Hey, because it had a bias or had in one sample, you have like a massive amount of these ribosomal genes. And then of course, these things have to be compensated for. One of the other things is that microarrays are manufactured in batches. So the best thing would be to get your microarrays all from the same batch, but that doesn't happen. So it sometimes happens that one of your microarrays was produced like a year ago and the other microarrays were produced like yesterday or 10 days ago, right? So this also makes a difference. Every batch of microarrays, they have their own like unique qualities. And this again, depends on like the temperature when they were manufactured, but also the quality and the accuracy that you can manufacture them with. So that's why we definitely need to normalize. Furthermore, and you want to compare groups of data because generally you have like disease tissue or healthy tissue, or if you're like fat cells with liver cells, with brain cells and these kinds of things. And so you want to compare data of different groups, different biological groups. And of course, you have to use statistical methods to do that. And then generally in the last step, you want to cluster data and then you want to see kind of similar expression profiles. Hey, you wanna see that, oh, and the cells that were infected by a certain virus, these genes go up, while normal genes, this doesn't happen. So that would make these genes important for kind of figuring out how the virus is entering the cell or how it is replicating. So clustering is one of these things that we do a lot to kind of, yeah, because you can't look at 50,000 genes at the same time, but you can look at a tree which has like 50,000 root nodes and you can still see patterns there. But in like a matrix filled with numbers, seeing patterns is really hard. All right, how long have I been talking? I've been talking for like 46 minutes. So I think I'm gonna take a short break and then we come back and we talk a little bit more in detail about normalization. So there's two different types, but I will stop the recording now then I can have a cigarette go to the toilet and I will be back in like 10 minutes. All right, so first break, I have prepared really beautiful gifts for you. The first break is cows, no pigs. First break is pigs, second break is koalas. So I will see you then in like five to 10 minutes and in the meantime, enjoy the kind of sweet and loving cuddling pigs during the break. All right, then see you in five to 10 minutes.