 Welcome back to the third part of the lecture. We have around 40 minutes left, and I think that I can finish. I still have like one third of the slide, so that actually fits very well. OK, I get a recording from Florian. So actually, you can already hear me and see me, since I'm still watching the ping-wings go by for the coffee break thing. It's interesting that I'm so far behind myself, but all right. So thanks for reminding me. Who's still here, by the way? So I'm seeing the viewer count drop a little bit, so it's getting late, I know. And it's a difficult lecture. There's a lot of different RNAs, and it's not really a lot of bioinformatics. It's more like biochemistry, biomolecular sciences, molecular biology. But just throw in Chad if you're still here, then I know how many people are still actively watching, and how many people are already falling asleep. All right, so at least that's two or three that are still here. I think my moderator is also still here. Very, very good. All right, so a short, OK, yeah. Yeah, so there's like a two-minute delay, I think, or two minutes, something like that. Good, so short story about... Not sleeping. Very good, very good. Short story about the Berlin muscle mouse. Just some slides to show you why RNA sequencing can be really useful. So we have these four mouse strains here in our mouse house, so the 866, which is the really, really fat one. And then we have the 816, which is the black also, well, not fat, they're muscular, right? They look fat, but it's all muscle. It's not big bones, it's all muscle. So three different mouse strains that we did RNA sequencing on, and the black six mouse is the reference mouse, so you don't have to sequence it. So we didn't do that, but we use this black six mouse as our reference mouse. So we use it to align all the reads against. All right, so here we have more or less the results. So we extracted the RNA and all of these things. So when you look into the IGV, and after you've done all the steps and have aligned all of the data, then what you can see is very clearly the reads. So the reads are here, so you have like these bars, and these bars are the reads. Then you have these long stretches, which you would normally, when you do DNA sequencing, called deletions, right? Because a read, more or less, is coming from here. Then the read more or less ends and continues further on. But since we are doing RNA sequencing, this is very common, and this has to do with the intron axon structure. Because we are only sequencing mature mRNA, we are not seeing any of the introns. So what you see here on the bottom is the myostatin gene. So the myostatin gene is the gene which is causing muscle growth. So it's one of these genes which regulates muscle growth, and it's a negative regulator, meaning that when you knock out the gene, muscles grow uncontrollably. So this is kind of a break on muscle growth. So you very clearly see that most of the reads that we have, or almost all of the reads that we have, they fall actually exactly on top of the intron axon structure that you see. So reads are coming from the axons, and they are not coming from the introns, which means that we did a good job, right? In the lab, the people did a really good job removing all of the pre-MRNA. The DNA was not in there anymore, so the sample preparation was really well done. And then here you can see the depth. So these little wavy things are how many reads we have when you would add them all up. And then here on the top you see the variants that we call. So you see the 806 at the top, then the 816 in the middle, and then you see the 866. And why are we looking at myostatin? Well, it's the main regulator. But what we see here is very interesting, because what we can clearly see is that the 806 and the 816, they have mutations in the end of the gene. So this is here, if you look at the structure below, the big parts are the parts which are coding for the amino acids, and these are the untranslated regions. So they are the control sections of the messenger RNA. So you see that the 806 and the 816, so these which are kind of medium muscled, you can see that they have mutations in this control region. So that is probably the reason why they have more muscles than a standard mouse is because at the expression level of this gene is changed causing in these mice to have, that these mice have slightly lower amounts of myostatin, meaning that muscle growth is not inhibited as much. So had they just become a little bit more muscled. But for the 866, you see that there are no mutations in the end, but there's a single mutation here. And this mutation here is actually a little deletion. So there's a little deletion which deletes part of this myostatin gene. And this is causing this mice to become extremely muscular because this little deletion actually makes it so that this gene is completely inactivated. And because this gene is completely activated, muscles can grow uncontrollably. So you see that the level of the expression of the gene is more or less similar in all of the three mouse strains. You can see that the gene is expressed because we did find the mRNA. But the protein which is being produced is kind of stopping here because of this deletion, there's an early stop codon being introduced. So the protein is only made like up until the second exon and the last part of the exon, although it is on the mRNA is not being transcribed into a protein. So very clear results, and this is something that you cannot find out using, for example, microarrays. If you would do microarrays, then the microarray will tell you that the myostatin gene is expressed in all three lines. It's expressed probably very similar to the B6, although the first two might be kind of lower than the B6, which might explain why they are. But in this case, this single mutation has this little deletion that is found within the myostatin gene is causing this gene to be broken and is causing this mouse to develop an extreme muscular phenotype. So if we then look into that a little bit closer, and then here we zoom into exon 2 and you can indeed see here that there's a relatively big deletion going on. So this relatively big deletion is the one that is really causing this protein to be non-functional causing these mice to be extremely muscular. All right, so we talked about microarrays already a lot. And so I think we already told you that there are one color microarrays, which are microarrays which have just a single color instead of two colors. And so one of the advantages there is that it's easy for you to compare between different studies and things like compensating for a batch effect. And the big drawback of using one color microarrays is that you need two times the amount of microarrays to measure all of your samples, right? Because if you're using two color microarrays, then you're measuring the relative abundance. So you're comparing one sample versus another sample. But then of course the big drawback is that if one of these samples is relatively different from the other samples, then of course this would be a big issue because all of a sudden a whole microarray will show you that all of the genes are downregulated because one of the samples just wasn't processed very well. I don't want to talk too much about microarrays. I just want to tell you that there are nowadays different types of microarrays called tiling arrays. And these tiling arrays are there to increase the resolution and to also be able to, when you use microarrays to follow the different transcripts of a gene. So how does this work? Well, instead of having probes, which are targeting genes very specifically, you can have probes which tile across the whole genome. So these microarrays have like millions and millions of probes on there. And these probes can be either partially overlapping or not overlapping. But what happens is that every part of the genome, so every section of the genome has a microarray probe targeted towards it. So there are probes that are also in the introns and usually you also tile both sides of the DNA strand and to make sure that you catch also genes which are on the negative strand. So the big drawback of these tiling arrays is that a lot of the probes on these arrays are non-functional because microarrays measure RNA. And of course not all of the genome is being transcribed into messenger RNA. You have introns which are not caught. And so all of the probes which are located in the introns will not work because there won't be any signal coming from them because these parts are not in the mature mRNA. But this is one of these ways that the microarrays try to kind of be as informative as DNA sequencing. And tiling arrays are relatively expensive but they are still a lot cheaper than doing a full RNA second experiment which is much more expensive. So tiling arrays have an advantage of having an increased resolution because you have tiles across the entire genome. There are probes in the introns which is nice but it's directly also the thing that leads to a large amount of non-functional probes because introns, although they are transcribed into the pre-mRNA, generally you are interested in the mature mRNA because that is the thing that is producing the proteins. Again this same slide I think we already had it. So when you do bioinformatics analysis on microarrays these are the several steps that you go through and it's just something to remind you guys that I do want you in the end to know which steps are there when you do a microarray experiment. So you create the arrays, you extract your samples or you acquire your samples, you extract RNA, you go from RNA to DNA using reverse transcription then you do a PCR step which is optional. You label your probes using Psi3 and Psi5 if you have a two-color microarray, if you have a one-color microarray you generally use Psi3, then you do hybridization, you scan the arrays, it produces a TIF file which is just an image file. This image file is then processed in such a way that you do data storage which is generally done in a cell format and then you do data normalization, you extract the expression levels and you do some clustering on the gene expression and then you try to interpret the data that you get from your microarray and during the assignments we will do the normalization and the extraction of expression level step using R. So again, a little bit of R, I think it's mostly copy-paste but there are like one or two steps where you have to add a little bit of code and you have to kind of struggle with the data to see if you can get something meaningful out of there. So when we talk about microarrays I also have to mention that there is something called MIAME and MIAME is the minimum information about a microarray experiment. So if you ever want to publish about your microarray experiment then you have to realize that you have to write down a minimum number of information about which sample that you put on the microarray, how much RNA did you extract, how much reverse transcriptase, how much DNA did you end up with. But it is a list of things which is intended to specify all of the information that is necessary to interpret the results of the experiment unambiguously and to potentially reduce the experiment. But this MIAME, although it is a very good structure, the thing that I don't really like about it is that it does not specify the data format in which you have to store this information. So this information can be put in an Excel file or in a text file or in a Word document. And generally you want to have this standardized as well. But realize that if in the future you are thinking about doing a microarray experiment that there is just a standard that you have to adhere to, that you have to write down all of the different steps of the experiment and that every step that you do has to have like a part of information in which you have to record so that someone else can redo the experiment and that you can allow interpretation of this experiment. All right, and then we get to the part where it's going to be fun or at least the thing that I think is really fun and that is the free microarray data. So like I told you guys in the beginning, free microarray data is available and there are two main sources of free microarray data. One of them is gene expression omnibus which is run by the NCBI and that has around 25,000 microarray experiments stored in there, so around 600,000 arrays. So 600,000 free microarrays that you could just download and do analysis on. So the gene expression omnibus only provides storage and retrieval. So anyone can upload data there and that is also one of the drawbacks because they don't do any curation. The data is more or less unsorted and they do follow this MIAME standard but it is very difficult to exactly see how good an array is or what the exact conditions were under which they ran the experiment. Then there is also another database called Array Express. This has slightly less experiments in there but they have much more arrays. So in total they have their archive which around 24,000 experiments, around 700,000 free microarrays that you can get. Of course there's some overlap between the two databases but the nice thing about the Array Express is that it is curated. So they have a gene expression atlas set which is around 5,000 to 6,000 experiments, around 130,000 arrays which are manually curated and re-annotated of the archive data. So that means that this data is of really, really high quality. So if you are lucky and you're looking for, I want to have for example, this plant and I want to have a certain leaf or I want to have the flower and I want to have gene expression data done on this exact part of either my plant or of my mouse or of my cow that I'm interested in, then if they have this data you can get this data for free. So you don't have to do the experiment yourself, you don't have to buy 100 microarrays, you can just download this data from them. So they provide storage, retrieval and a couple of analysis tools which you can do online and they have different biological conditions and experiments or at least in the gene expression atlas. These are all curated so that means that you can actually compare across different experiments which is a really, really nice resource. And again, there's probably like a couple of nature papers still hidden inside of this data amount that they have there. So just by downloading some of the data asking the right questions you can probably get some really nice high scoring publications out of these two data sets. So the gene expression omnibus you can find it here at ncbi.geo and it looks like this. It's a public functional genomic supporting MIAME, array and sequence data are accepted. Tools are provided to help users query and download and curated gene expression profiles. So they do have some curation but not for the raw data that you get there. And the website looks very simple. It's just a very basic keyword or geo accession so you can fill in your keyword here, press search and then see if any of the data or any of the microarrays that you find are suitable for your experiment. So if you look at the gene expression omnibus that lives, oh, wait a second. Yeah, no, it's the same database. So this is still geo. So if you search, then you get an overview of how it looks. And so you see that, for example, there's acute dengue from patients, so whole blood, you have embryonic stem cells. So there's a lot of different data in there. So if you ever are interested in, well, I have done an experiment and I think that this is the gene that might be responsible, and then you can look and see in geo to see if you can find more data or microarray data to back up your kind of conclusions there. All right, did I not make a nice, I didn't do the one from Array Express. I actually thought that I actually put a, this is geo, this is geo. Okay, I put in two slides of geo. I wanted to have one slide of geo and one slide of Array Express, but the Array Express just, you can just find it. All right, so then there's of course more free mRNA expression data. And one of the things that I think is a good database to have a little bit of a look in is Bio-GPS. And here they took a slightly different approach. Here you have different expressions across different cell types, so across different tissues. So if you're interested in, okay, so I find for example, the myostatin gene in my experiment, and now I want to know in which tissues is this gene expressed, then you can query Bio-GPS, and then they will show you in which tissue your gene is expressed and at which level. So sometimes it's very interesting to know that the gene that you are working on is also expressed in a different tissue, which might lead you to say, well, but if my gene is expressed relatively highly in a different tissue, then I might want to kind of extract RNA from that tissue instead of from the tissue that I originally started looking at. And this, so for example, the myostatin, if you figure out that it's much higher expressed in skeletal muscle than it is in heart, and then of course you would do skeletal muscle instead of doing heart. But Bio-GPS, they have a very good overview of in which tissue, which gene is expressed, and of course you can also get access to the raw data and download it. If you want to get free sequencing data, so free DNA and RNA sequencing data, you can use the Sequence Read Archive from NCBI, formerly known as the Short Read Archive, but they renamed themselves to Sequence Read Archive. And it lives there. They even have like a cloud availability so that you can directly do the analysis on an Amazon instance instead of having to download it all to your own computer. And the nice thing here is that the download of the data can be done via the command line. So you can use the SRA toolkit on Mac OS X, on Windows, and on Linux to automatically download data sets from their repository. So they have a lot of sequencing data in there. This is how it looks like when you go to the website. It looks very, well, Windows 95 in a way, like it's not fully like responsive web design or the most beautiful web design. But in this case, it's about the data which is in there. So for example, I did a search for how many SRA-related objects there are related to COVID-19. So you can see that there are almost 142,000 publicly accessible SRA experiments where they did sequencing on Coronavirus-19. And you can also see that there is like 74 studies. So a study can have multiple experiments. And you see that in total, there are 139,000 samples which have been deposited to the SRA archive. So it's really like a massive, massive database. And of course, they also coupled to Geo. So you can see how many GeneXpression-onlybus data sets are related to here. So it kind of combines, on the one hand, access to the sequencing data. And on the other hand, access to the microarray data all in one go. So a really, really good archive to get some sequencing data if you have no sequencing data or you don't want to spend the money to generate a whole bunch of sequencing data. You can get free sequencing data from there. And for example, practice sequencing data alignment and these kinds of things. All right, so the last part, I have like 10 slides left but I think we can go through it quite quickly and still finish at five, is structure of RNA. So I told you that RNA, when you write it down, people always have the idea that it's just a linear molecule, right? If you think about messenger RNA, you'll always see these linear pictures with introns and axons. But RNA, since it's a biochemical active, it's more or less something which is biochemically active. It has a secondary and a tertiary structure. And the 3D, so the tertiary structure kind of determines how the mRNA kind of works. And so if you look at this cloverleaf design of the tRNA in the real world, more or less if you would look at it in my 3D visualization app, then it would look like this. So the colors here are the same as the colors here. So you can see that, although we are always saying that, well, you have the anticodal arm and you have the, you can see that in 3D, some parts are much closer together as that you would expect just based on either the linear sequence or even based on the secondary structure. But you can see that the secondary structure in a way is closer to the 3D structure than that it actually is if you would write it out on a linear scale. So RNA, the function or the way that RNA works is based on the 3D structure of the RNA. And you can see that the RNA itself is formed in a way which allows it to have this function, right? So it's not this like helical thing which is here on the side which people always imagine it to be. No, it's really a 3D kind of working object. It's more like a crane. If you would describe like a blueprint of a crane, then it would not look like the eventual machine. And RNA in that sense is kind of a machine thing. So one major task of bioinformaticians is to predict the structure of RNA. So predicting the secondary and the tertiary structure of RNA. And there are literally thousands of tools available that predict or do predictions of the secondary structure of RNA. What they do is they take a sequence, RNA or DNA and create a highly probable annotated group of secondary structures. And then they order them by lowest free energy. And then had they do like a predict a secondary structure. So a couple of examples are things like RNA fold, context fold or RNA shapes. And I wanted to go through the analysis of how do you go from a primary structure, so just the sequence to a secondary structure and tell you a little bit about how you can kind of interpret these results. So the RNA fold web servers looks like this. So you go to this address. There's RNA fold running here. And here you just have a, you can just paste or type your sequence in. And then hey, you can select which algorithm you want to use. I always use the first one, the minimum free energy and partitioning function. And you can say that I want no GI pairs at the end. But so you have a couple of options which you can allow to kind of tweak the prediction, but the prediction is actually very simple. It uses kind of a machine learning method which has been trained based on the secondary structures of other RNA molecules to kind of predict the structure of a new RNA molecule. So one part machine learning, one part using kind of the minimal free energy. And so doing kind of a in silico prediction of how this molecule would fold when you put it into water. So one of the things that I wanted to show you guys is, well, if we think about, for example, SARS-CoV-2, one of the RNA viruses which is out there, then it has like this envelope protein, right? So this envelope protein is on the outside and the spikes are embedded into it so that it can kind of enter the cell. But of course, the RNA of the virus itself also has a certain structure. And this structure can also cause other side effects. So one of the things that I wanted to show you is, now how can you do, for example, a prediction of how the RNA looks from the envelope? So the first thing is, of course, you have to get the RNA sequence which you can get from NCBI. So you can go to NCBI to the gene database. And then the thing that I did is just search for SARS-CoV-2 envelope protein, just to get a sequence to work on. And then the next step is to just take this sequence that you downloaded and go to RNA fold and throw it in there to kind of do the prediction. So when you go to NCBI, you have to make sure that it's on gene. You search for SARS-CoV-2 envelope protein and then I just took the first one. There's literally hundreds of them in there because people already did a lot of sequencing and there's different variations in there. But I just took the first one from the list. So you fill in your search term, you press enter and then you just click on this little E because the envelope protein is called E. It doesn't have a more extensive name. And then what you get is you get an overview. So you see here the genomic context. So you see the part of the viral genome here. You see the spike protein. Then you see that there's something called R3A. Then you have the envelope protein. So the protein which envelopes the virus particle. And then you have some other proteins in the back. If you wanna get the sequence, you can just click the FASTA icon here. Again, just download the FASTA sequence. And in this case, you just get a text file which has the sequence. And then what I did is I just go to the RNA fold web server and I just fill in the sequence that I just downloaded. And so we can see here that this is the coronavirus two. So this is from Wuhan who won. So that's the first, that's the reference genome. And this is the sequence. So the DNA sequence of the envelope protein. And then I just click proceed. Like I put, just use the default options. And then what I got back was something which looks like this. So this is the structure of the RNA protein which is encoding for the envelope protein of SARS-CoV-2. So here you can see that I have two different colorings. These are two slightly different prediction methods as well. So this is the MFA secondary structure. So this is the secondary structure based on the lowest free energy. And this is the secondary structure based on the centroid. So the centroid is the structure that comes up most often. It's kind of the average structure. So hey, it does the structure prediction. So it predicts like 10,000 different shapes. And then this is the shape which has the lowest free energy. And this is the shape which was predicted most of the time. You see the coloring. So in this case, the coloring is based on the base pairing probability. So here you see that there's a high probability because the scale is inverted here. So one is high probability and zero is low probability. So you can see that there's a high probability that there are like a couple of these spikes on the RNA molecule. And then on the other side, I colored it by the positional entropy. So the positional entropy is how much wiggling room is there in the prediction. So you see that this part actually has no wiggling room. It is probably really, when you would look at it, then these base pairs would kind of bind together. And then this loop would be there. But for this part, it is less clear. And this part can be together, but sometimes it's also not together. That's why it has like an average positional entropy. And then here you see that there's only one single little blue part. And this little blue part tells you that, well, it's not clear where this is because it could kind of flex any way around. So there's a lot of flexibility in this RNA protein. So this is then how the secondary structure would look like. And together with this secondary structure, you can go into a third tertiary structure prediction. And then it will kind of see how it can fold in 3D. And then it will also do a prediction. Well, it might be that this part is for example, very close to this part because it could be like wrapped together. For comparison, I showed you one of these tRNA. So this is a standard example tRNA, which I did. I think this is the one which encodes valine. So you see that it gets predicted to be indeed in this four clover leaf structure, or actually three clover leaves with a stem. And again, I color it by the same way. And you see that, well, this one is very certain to be kind of attached together because of the fact that the pairing here and the pairing there is almost perfect. So there's no mismatches. But you see that here, the pairing is less likely to be. It's still relatively high. There's no blue here. Blue would mean that the prediction is really bad. Green means that the prediction is relatively certain and red means that it's very certain that this will occur. And here on the other side, you see the positional entropy again, and then you see indeed that there's no, at the prediction, every prediction that it does makes, puts these things together. And then, hey, you see also that here, these four are always very, very close together in a loop, but it's less clear here. There could be a little bit more wiggling room in how these things are folded. So it's a nice bioinformatics tool. They don't only provide these kinds of pictures. They also have dot plots and other kind of metrics to kind of tell you how well the structure prediction is. But in the end, it's the tertiary structure which makes the function of the RNA and not so much the kind of linear stretched out sequence. All right, so the conclusion is, is that there might be some structure to the RNA encoding for the envelope protein of the SARS-CoV-2 virus. Of course, you have to be well aware that if you use these tools, they will look for structure, so they will find the structure. That doesn't mean that this is really the structure, right? It doesn't really mean that it's like that. But you have to remember that, although when you look at images of RNA or of messenger RNA, it always is drawn as a single large line, but RNA is never a linear stretched out fragment. It's never like half of this helix, which you see in many pictures. RNA functions because it has a certain structure. An RNA molecule can have a function and a catalytic activity, and generally it does have a very strong function and a very strong catalytic activity. And it's a 3D molecule, so it functions via kind of a lock and key system. So a certain molecule fits exactly in the hole that you see into the polymerase or in the ribosome. So that's why it is messenger RNA. Messenger RNA is this because it has a certain structure. Catalytic enzymes or catalytic or ribosomes, they have a certain function just because they are folded in a certain 3D structure. All right, so a word of advice, don't blindly trust any prediction without any experimental confirmation. And in many of these tools, not in the RNA fold web server, but in many different predictions that are there for RNA structures, you can include some experimental constraints. So if you have measurements on which base pairs in the RNA molecule are relatively close together and which are relatively far apart, then you can have this flow in to the prediction that you're going to do. And there are of course, there's also a big database from the University of Leipzig, which has validated tRNA structures and other RNA structures. So they do tRNAs and microRNAs and long non-coding RNAs and they predict or now they have validated structures. They also have some predictions in there, but these are more or less, hey, if you wanna know how does my RNA molecule look, how does the 3D structure look, then you can go to this web address and then the bioinformatics groups in the University of Leipzig, they have a really good database on validated tRNA structures and also other RNA structures. And some of them are predictions, but they mention it very clearly if it's a prediction or if it's a validated structure. All right, so a lot of things that we talked about today, I talked to about the history, the DNA versus RNA, so what are the differences? So the backbone is slightly different. You have a different base pair in there. RNA is often chemically modified, so the base pairs can be kind of changed like we saw with the Uracil, Proto Uracil. We talked a little bit about mRNA expression, not a lot. We talked about RNA sequencing. I showed you the example of our Berlin muscle mouse where you can clearly, or where it's very clear that this small deletion inside of the myostatin gene is the thing which is making this 886 mice so extremely fat. We talked about the structure of RNA and about where you can get free mRNA data and free sequencing data and about how to do a little bit of structure prediction using the RNA Fold2 server. So for me, that's all for today. Oh, I didn't include a question slide. So if you have any questions, just let me know and it's time for you guys to do the assignments. I didn't put them on Moodle. I will do that directly. So if you wanna start, let me directly do that actually. Moodle and there, Tic, turn editing on, add a topic, number four. Hello, I'm still doing the lecture show and I will upload the assignments for lecture four. So just to shout out who's still awake, who's still sleeping? See, at the end I lost everyone. Everyone's sleeping already. All right, sleeping, a bit sleepy. Very good. All right, so let me upload the assignments and then I will upload the other stuff tomorrow because I wanna go home as well because I'm also relatively tired. All right, lecture four. Okay, so the assignments are online. So for anyone who's still like fit and wanna do the assignments, then you can start with the assignments. So, all right, Commando already has weekend. Very good, very good. All right, I will at least stop the re...