 everyone if you're watching this on Moodle or if you're watching this in YouTube in five years, like, subscribe, comment and all of these things and ask questions. There are probably dozens of questions and RNA is a very difficult subject. So we'll continue for like another 35-40 minutes and then we should be done and you guys are experts in all the different types of RNA that are there and how to measure them. So I think that many of you will have done a course where you extract RNA, transform it to cDNA and then do a QPCR experiment. So QPCR experiments are really small-scale experiments. You generally measure like one or two genes and you can measure them very, very accurately. The nice thing is is that if you look at a QPCR experiment, the output looks like this. So you have cycles. So the cycles are cycles in the PCR. So in every round the available cDNA is amplified using primers and at a certain point it becomes detectable and then the earlier a curve comes, the more RNA there used to be in the original sample. So if you set a certain threshold, so for example a threshold of 1.6 and then as soon as the line crosses 1.6, you look at which cycle did this happen and then you can compute how much mRNA there used to be in your sample. However, this is always done in a relative way because you always use a housekeeper and this is because when you are pipetting you cannot pipet exact amounts. So you're always pipetting like the gene, you're pipetting in one well, you're pipetting the gene of interest and in another well you put a housekeeper and these housekeepers are used to standardize relative to the gene that you're interested in. So all values that you get from an mRNA experiment are relative expression values which means that you know that well my gene of interest was expressed 50% of the housekeeper or it was expressed only at 25% of the housekeeper. So it doesn't tell you exactly how many mRNA molecules there were but you get an impression of how active a gene was relative to a housekeeper and housekeeper genes are genes that are generally always on. So no matter which cell type you're looking at the housekeeper is there. So those are generally structural proteins which are used like proteins which code for cell wall. Every cell has a cell wall so there always has to be proteins that are making cell wall. So small scale experiments QPCR is the way to go because hey you can measure a single gene very accurately and hey you can measure it across like a wide number of samples and it's relatively cheap compared to using microarrays or sequencing. If we then look at large scale experiments then you are forced to do microarrays which kind of do hundreds or thousands of genes at the same time and you can do RNA sequencing which does the same thing but more or less untargeted so you get the expression of certain genes and you get like reads and these reads then get mapped to the genome. So we'll talk a little bit about it more but in the end what we want to see is for example a picture like this where we have a gene expression right so we have different samples for example B6N reference mouse BFMI Berlin fat mouse then we have maternal B6N and maternal BFMI mice so mice which had as a mother a B6 or as a mother a BFMI and then what we are looking at is or what we want to see is genes which are different between for example the the fat mouse and the standard mouse and then we are interested in in which genes are for example highly expressed in the fat mouse and very lowly expressed in the B6N mouse because these genes are targeted for for example understanding how the mouse becomes fat but also to understand which substances we might be able to give the fat mouse so that it doesn't become as fat so hey you can use it for intervention but you can also use it to study the default phenotype. So if we go back to QT RT PCR it is very similar to normal PCR we will have a lecture about polymerase chain reaction so that will be lecture number eight when we talk about primer design and PCR so the QPCR steps are very similar so you design your primers for the genome that you want to measure you design primers for a housekeeping gene which generally you just buy from a company because they're standard and then you do your PCR reaction with both simultaneously and then you determine the relative expression of the gene versus the gene of interest versus your gene of interest versus your housekeeper gene so you get relative expressions. So if we talk about RNA sequencing and we won't look too much into microarrays because we already talked about microarrays and they will come back then RNA sequencing is again very similar to DNA sequencing which we talked about last week there's only an additional step or actually three additional steps and the first additional step is that you have to transform your RNA into cDNA this is done using reverse transcriptase so reverse transcriptase is a protein which takes an RNA molecule and then produces a DNA molecule with the complementary sequence then in the next step you make your cDNA which is single stranded into double stranded cDNA and then you add single strand tails and these tails are then used in in DNA sequencing to identify for example your your sample because you can you can you can sequence multiple different samples at the same time but in the end after these three steps it is just DNA sequencing like we discussed last time. So again here the input and output are just text format files which you can read so the raw data is stored in fast queue files and the aligned data is called is stored in into files which are called sum files I think I have yeah so here's the description how the input fast queue file looks like so this is the file that you get from the sequencer so it has four lines for every read so the first line starts with an ad character and is followed by a sequence identifier and an optional description it has raw sequencing letters in the second line so there it's the ACTG that was determined by the sequencer the next line is a plus and is optionally followed by the same sequence identifier as the ad line so it's just the repeat almost always this line is empty so if you look at this fast queue file here which was one of our own experiments that we did here hey you see that first you see an ad then you see the name of the read so this is hwi st225 and then hey you see the the sequence then you see the plus and of course you can put stuff behind the plus but almost always it's just an empty line and then you see here the coding of the quality so the quality is encoded into single letters so here this hashtag represents the quality of the n base pair that was read and of course the n base pair is a very very bad base pair and we see then for example a c which has a quality score of as a quality encoding of zero and then we see another c which has a quality encoding of six right so the third line is a plus character which is optionally followed by the same identifier but almost always this third line is empty and then the fourth line is the quality value of the sequence in line two it must contain the same number of symbols as the number of base pairs of the read so every base pair in the read needs to have a quality value assigned to it so in RNA sec reads are aligned to the reference genome which is based on sequence similarity allowing for mismatches just like DNA sequencing and RNA sequencing allows us to investigate the quality like which genes are transcribed and how much are they expressed right so what is their expression level but also we can look at the quantity right so how much of a certain transcript is expressed so quality is which gene is expressed quantity is how much is it expressed but we can also look at things like as like single nucleotide polymorphisms and we can also look at things like insertions and deletions in the mRNA which is for example caused by RNA editing done by snow snow RNAs right because that there can be the u can be transformed into a phi and we can also see if genetic SNPs on the genome are actually transferred into the RNA or if they are modified because it happens sometimes that the genomic DNA is not a one-to-one match with the messenger RNA that is produced because of post translational modifications done by short nuclear RNAs so the RNA sec analysis is very simple so we acquire our samples we extract the RNA we then do this additional step where we transform our RNA into DNA and then we just do DNA sequencing which gives us a fast queue file we trim our reads which again gives us a new fast queue file now with the bad base pairs cut off like last time we do alignment duplicate handling indel and then basery calibration and then the only difference in the bioinformatics analysis comes at the end so in the end we are not looking at single nucleotide polymorphisms but we are generally looking at the expression levels of a gene which means looking at a gene and then counting how many reads were there for this gene and then we look at the next gene and we just count how many reads there were for that gene so how do we visualize RNA sec data well it's exactly the same as DNA sequencing data we look at RNA sequencing data using the IGV which we already talked about last week so I just wanted to show you an RNA sec example which was done in our lab like a couple of years ago and here we see the kind of brother mouse of the berlin fat mouse and this is called the berlin muscle mouse so we don't only have fat mice we also have very muscular mice so you can see that this is the standard reference mouse the black six and you also see the 866 which is the very muscular mouse and then we have also different types of muscle mice so we we have this this 866 which is the like beefiest mouse but we also have mice which do have more muscles than a standard mouse but not as much as the the big berlin muscle mouse so here we see for example how RNA sec data looks so what you can see here is you see on the top you see the genome that we're looking at on the bottom you see that we're looking here at a gene called myostatin which is a very well known regulator of muscle growth and here you see 806 816 and 866 right this are the the really muscular mouse the medium muscular mouse and the slightly muscular mouse and here you see them see these three again so what you can see on the top three lines is the the snips so you see here that for example the 866 mouse has a snip in the second axon of the myostatin gene right it's actually not a snip it's a small deletion if you look here because you can see that there's a little hole right so the muscle mouse doesn't have um this part of the axon right so the big difference between RNA sequencing and DNA sequencing is very obvious from this picture because you can see that the reads that we get are all exactly on top of the axons and we get no reads on the introns and that is because we're sequencing messenger RNA so we're sequencing messenger RNA which has been fully processed right so the introns are not in there anymore so when we do sequencing we get reads which are exactly aligning onto the axons and we find no read in the introns and of course we can see here that for example the 806 and the 816 they have some mutations and so three snips um in and a small deletion here you can see that from the little dot here um in the in the sequencing reads um but hey you can see that there are four mutations in the third axon of myostatin so these are then in the UTR so the box here at the bottom just as an explanation you see the the five prime UTR this is the the small the small part and then you see the part which is coding right so this is the coding sequence so axon one has a small five prime UTR then it has the coding sequence um and then you see that the second axon is just the coding sequence and the third axon has a very small part of the coding sequence as well and then a very long tail which is not translated into proteins so of course when we look at this we can then say oh that is interesting right so the the really really fat mouse um if we zoom in onto axon number two hey we see that this very very fat mouse has a mutation um like a deletion in this part of the gene which means that the myostatin protein produced by the 866 mouse is different from the 816 mouse and is different from the 806 mouse because there are a couple of amino acids which are missing here right because there's a deletion in the in the genome so head like you can imagine that if there's like nine base pairs which are deleted then the protein which is produced in this mouse would be three amino acids shorter so when we look when we look very closely we see here that there's this deletion and hey if you then look at this deletion you get an idea okay so this this myostatin gene which is kind of a break on muscle growth might not be active in the 866 mouse and of course when we look at the other two mice hey we see that they they have the same protein so they have a functioning break but they have mutations at the end so in this untranslated region and this untranslated region is of course very important in the regulation of how much break is being produced so from this we can learn that had the 866 the very very muscular mouse has a non-functional break so it will just continuously grow muscles without having a feedback from the without having this cell which kind of stops muscle growth while the other two mouse strains do not have this this deletion so they have a functional breaking protein so they have a functional stop but their functional stop is probably having some expression differences it's probably expressed at a lower level compared to the reference mouse strain and we can see that from the deletions here which are in the five prime UTR of this myostatin gene right so we we we can clearly see that the phenotype of these mice corresponds to what we see in the DNA sequencing data one of them has a broken and the other two have a have a difference in expression caused by these four snips at the end all right so micro RNA micro arrays I just want to repeat a little bit so you have two different types of micro micro arrays one of them is a one color micro array which allows you to compare across different studies you can compensate for batch effects very easily using one color micro arrays but you need two times the amount of micro arrays right because it's only one color so you have one micro array and you can put one sample on there often we use two color micro arrays when we're doing kind of a case control study so if we have a study where we're interested in cancer tissue and normal tissue then we use two color micro arrays we color the cancer tissue for example red and the normal tissue green we put both of them on the same micro array and then we get the relative abundance so we we we can see if a gene is higher expressed in the cancer tissue or if a gene is higher expressed in normal tissue so hey this is relative abundance again just like in qPCR but of course the problem with two color micro arrays is that if your if your healthy tissue is not really healthy right then of course you get really weird signals sometimes just wanted to remind you guys that micro arrays come in in two different forms one of them is a one color micro array which kind of give you an absolute quantification so it tells you that there was an X amount of an X amount of mRNA while the two color micro arrays give you a relative has so kind of like the housekeeper in qPCR so they are slightly different we also have micro arrays which are not only targeting the axons but we also nowadays have micro arrays which are called tiling arrays and these are micro arrays which literally contain millions and millions of probes and each of these probe is more or less tiled across the genomic DNA so we are targeting every part of the genome having a probe there and so that means that we have a very high resolution because we can measure every like 50 base pairs of the DNA and we have probes for introns so we can see if an intron for example is by accident being expressed for example because there is splicing is not working correctly so we can look at splice variants and these kinds of things using micro arrays using tiling micro arrays and generally they do both sides of the DNA strand so you have a tile which is targeting the forward strand and you have another tile which is targeting the the backward strand the big issue of tiling arrays compared to normal micro arrays normal micro arrays only target known axons is that of course tiling arrays have a bunch of non-functional probes right if you have a probe which is in the intron 99 percent of the time this intron will not be there right because it's spliced out correctly and only sometimes will you find that oh that's interesting one of these probes gives me a signal so I know that in this animal the intron is retained which means that there's probably a problem with the protein production in this animal so we already saw this slide before so when we do micro arrays we have to create the arrays which is done using a tbt file format we acquire our samples we extract RNA do the reverse transcriptase we do some PCR which is an optional step we label them psi 3 and psi 5 or psi 3 or psi 5 if we're having a one-color micro array and then everything is just like we talked about before so we do hybridization we scan it we get a really nice big tiff image which I showed you in lecture one we store our data in cell format just because it's it's much smaller than the original tiff image and then all of the other steps are done in standard text formats which is what we like in bioinformatics because you can just open up the file and look at it in a text editor so when we talk about micro arrays I always have to mention the miame which is the minimum information about micro array experiments and this is intended to specify all of the information necessary to interpret the results of the experiment unambiguously and to potentially reproduce the experiment so the miame is when we start looking into different databases that are there which contain micro array data they actually have this miame format describing the sample so what type of animal is it how old was it um which tissue did we take what kind of hybridization protocol did we use what kind of a micro array so all of this data it it's it's um there is a there is a format which you need to use when you want to send micro array data from one group to another group um so that people can reproduce your micro array experiment all right so about now we're switching to more bioinformatics stuff because there is a lot of free micro array data available so there are two main repositories where you can get free micro array data and this is really useful because like if you are a student um and you want to do some um gene expression analysis and for example imagine that you're really interested in uh in lung cancer and you say well I I really want to do some analysis on lung cancer um but I don't have any money I don't have any funding but I'm still interested in figuring out if there are genes which are for example highly expressed in lung cancer tissue which might be useful as targets for medicine or I just want to know what the difference is between a fat mouse and a lean mouse right so you can get all kinds of free micro array data from all kinds of different tissue so one of the main databases that you can use is the gene expression omnibus which is um from the NCBI so the National Center for Bioinformatics um and this database contains more than 25 000 experiments and you can get around 600 000 micro array data sets for free um they only provide storage and retrieval so anyone can put their data there and anyone can download the data and use it in their own analysis um and this is one of the bigger databases um if we look at the other competing database there there's array express which is run by the european bioinformatics institute also called ebi um then this is an archive which contains more than 24 000 experiments they have around 700 000 arrays but they have something which is called the gene expression atlas and this is a subset of all of the data that they have collected over the years which has been curated and re-annotated by people which means that this is a very very valuable core set so have people went through all of the micro array data that was submitted and what they do is they look at the data they check to see if it really is the sample that people said it was if it's the exact mouse that they said it was or if it's the same human or the same plant um and this is this is the core set of the gene expression atlas is like a very valuable resource just to give you an idea of how valuable of a resource it is if you want to buy a single micro array this will cost you in the order of like 50 to 200 dollars so when we are talking about a data set of 700 000 micro arrays this is 700 000 times well on average 125 dollars so there is literally a billion dollar amount so that is just available for free which is of course amazing that you can get your hands on a billion dollar project without having to spend a cent the array express actually contains that allows you to store and retrieve data but it also provides tools to do analysis so it has online tools so you can just go to the website and compare different micro arrays and do clustering and create these pictures which I showed you here I think so pictures like this you can create can create online by their on on their website all right so um I just wanted to show you guys the the website so let's go to gene expression omnibus first um and just search for some data so that you guys know how to do that so let me get my window open here um so gene expression omnibus all right there it is all right so gene expression omnibus is a relatively easy website you just have a search um and you can see here the amount of samples that they have um is 4.7 million um they are 22 000 different types of micro arrays and there are 164 000 series which are in there and in total there's like 4 000 and so you can just type in any keyword that you're interested in so you can say well I'm interested in for example lung cancer right so lung cancer is just a a search term um you press search and it takes a little while you can just click here so here we for example see that there are so on the aphometrics human genome array um so this is human data um so the organism is homo sapien hey you can see that there are 602 data sets uh which have been done 5 000 series 61 related platforms and this many samples so if we just want to drill down a little bit and we want to say okay so we want to search for um we want to only see humans um we can click here to kind of filter the list and then how we can actually go down because these are the the whole data sets that are there um but if we go down is for example here uh a paper where the the paper authors actually put their data online so it says blood tests using serum micro RNAs can discriminate lung cancer from lung cancer right and how did they do that well they have like 4 000 micro arrays so 4 000 humans which are done on micro arrays um and you can just get the data right so you can just click on the link go to the study that they did and then here you see the miami data that they have right so the the miami tells you that there has to be a title you have to specify the organism the type of experiment you have to give a little summary um you have to have to give a little overall design like circulating micro RNAs of 3924 samples um and then you describe how many lung cancers there were a sort of like 1600 before the operation there were 180 biops after operation and there are like 174 1700 control samples and when you look at it you can actually see that so you can just click more and these are all of the samples and each of these links will give you a single micro array data set so we can we can click on one of them and so you can see here that this was um the sample identifier and so you can see that that this was done by some Japanese guys and this is the platform so you can get the annotation for the micro array and then um here you see the hybridization protocol and all of the data that you have to provide from miami so and here you can see all of the different probes so it had the probe names are of course not genames because there's a micro array probe which targets a certain area and genome and here you see all of the values so the expression values and so this is the the full table is very small it's only 62 kilobytes um but head you can just get the original file from the fdp and you can just download it and then that's one sample and of course if you want to redo the experiment you download all of the samples in one go and then you can load it into r and then you can do things like do clustering or you can do like normalization um so during the assignments we will we will use some micro array data not from geo but from our own group um so that you guys can look um a little bit at how you can analyze micro array data and what you can do with it um so this is for example a non-cancer control it was done from a female which was 56 year olds and it is serum so it's just blood um where they extracted RNA from so a very very useful site um geo um and a lot of people that I know that do bioinformatics use geo a lot to get samples for when they don't have money to do their own micro arrays right because if you if you just are interested in how highly is a certain gene expressed in for example brain and then you can do two things you can you can write a funding application get the money do the micro uh micro arrays yourself and then do the analysis to figure out if the gene is highly expressed or lowly expressed um but in many cases the first thing that you would do is go to geo and see if there are samples which you might be able to use and a lot of people that I know actually got really high impact publications from just using freely available data set for example from from geo and so again um and you can just this is the the old data set browser this is when you look at the curated part of geo so the gene expression omnibus um head then you can see that head you can see that the well homo sapiens or mouse head you can see which platform was used which because you have different types of micro arrays um and then the series to which it belongs which is the the kind of experiment and so here you see for example acute dengue patients whole blood so these were patients that were infected with dengue so they just do blood samples and they have like 56 blood samples which they put on a micro array so if you're interested in how does dengue influence blood you can just look into there and kind of look at it and of course you can also see their cluster analysis already um nice um that's gonna be a uh where's a band band yeah I was first I was first moderated deleted the message thank you moderator all right so one of the other things that is also um part of the gene expression at loss is bio gps um and that will allow you to look at where a gene is expressed so in which tissue is a gene expressed um so this you can do via bio gps um so geo oh yeah the gene expression sorry so why do I show you gene expression I I I forgot to put in a slide for array express but gene expression omnibus um is is one of the databases so you have two different databases um but there's some overlap and some samples are in both databases as well so if you're interested in in gene expression levels of a certain gene um you can also very quickly look at bio gps and bio gps allows you to look at tissue specific patterns so here what they did is they downloaded all of the data from uh geo and from ebi right so from the two databases and instead of looking at the micro arrays they looked at single genes across micro arrays done in different tissue um so you can for example have see if your favorite gene of interest is expressed in brain or if it's expressed in in in fat or in heart or in lungs and you can see this also in different uh individuals and different animals um so we can take a quick look at the website so let's open up the firefox um so this is this is how it looks very basic site and for example our favorite gene of interest here in our group is bbs7 which is a gene which is we found to be different in the berlin fat mouse so if we look at it and then we can see indeed that yes it has been measured in mouse um it's also been measured in humans and in rats um but if we just go to bbs7 and then we get this overview of all the different tissues right so we can see that this gene is actually highly expressed in for example the retina so in the back of your eyes um it is highly expressed in testis and it is highly expressed in in other like NIN 3t3 cells um but it is also relatively highly expressed in osteoblast and so you can see very quickly where gene is expressed and so if you want to for example search for myostatin right so the the gene which we just looked at which controls the muscle growth and then myostatin has also been measured in mouse so we can just click on it and then hey oh no that's sad so let's look at it in humans so in humans we can see that myostatin indeed is very highly expressed in cardiac myosite so in the heart um we can see that it's highly expressed in the supervisor cervical ganglion but also highly expressed in skeletal muscle which makes it a good candidate for muscle muscle growth and muscle inhibition right and that's just some of the information that you can get out of course the website provides much more you can also drill down and look at which experiments were done and one of the things is that myostatin apparently is also expressed in the adrenal cortex so within the brain um no idea why it would be expressed there but it's a really good website if you want to get a very quick overview in which tissue is my gene expressed and so if you ever have to write a a a a a for example a phd thesis and your supervisor is a big gene of fat two um then the first thing that you do is you go to bio gps fill in fat two and then you can see where this gene is expressed um so if it's expressed in brain or tissue or in lungs or in hearts or in anywhere not only do we have free microarray data and free microarray data measured across different tissues we also have the sequence read archive of the ncbi and this is the main storage database used in bioinformatics for all sequencing data so if in the future you ever have to do a sequencing run um and you are and you generate some sequencing data um then when you want to publish your paper the journal will say to you you have to put this data somewhere so that people can access it so that they can redo your analysis so sequencing data is stored in the sequence read archive um it's a massive massive website and they they have literally like petabytes of sequencing data and again all freely available you can you can well almost all of it is freely available some of it is still under um embargo but if you're interested in for example for example covid um and you want to know everything about coronavirus and the different variants and had like which mutations are there in the delta variant versus the alpha variant and then you can get this information from the sequence read archive so what they do is they provide data sets of sequencing runs um done in in in humans in mice in plants um and not only DNA sequencing but they also provide RNA sequencing um so depending on if you want to look at DNA level or if you want to look at RNA expression level um you just go to the SRA you search for whatever you're interested in and they will show you what is available um so and of course this is all to prevent people from spending money spending money spending money and every time measuring the same thing right so if you've measured something then you measure it once you put it in the database so that other people can use it so how does it look well if you look for example for free sequencing data on the SRA hey you can for example say well I'm interested in covid-19 you press search and then it tells you that in total there have been 142 000 covid-19 samples sequenced and deposited in the sequence read archive and this of course is an insane amount of money especially if you realize that every sequencing run um for a human genome is around a thousand to fifteen hundred euros of course for something like a virus it's much cheaper hey because the virus is much shorter so you don't have to produce as many reads um but again like literally billions of dollars worth of data available for free available for you to download and look at and do your own analysis on all right last part of the lecture so I told you about ribosomes right and that RNA has catalytic activity as well so and of course when we talk about the structure of RNA we always talk about the primary structure right so the primary structure is just the ATU uh the AUCGs so the order in which they occur but of course to have their function hey because you you generally want to look at the secondary structure or the tertiary structure right because RNA molecules they fold back on themselves and they function like a lock and key system so the structure of the RNA molecule determines how active it is and what it does and then generally people talk about secondary structure so secondary structure is more or less a picture like this like I showed you with the anticodal arm and then when we talk about tertiary structure we are actually talking about the 3d structure so it's it's not a flat representation of the base pairs and which base pairs connect to each other it is more or less the 3d structure that I showed you in my in my 3d viewer as well hey but these this is the thing so it looks like a clover leaf when you look at it in the secondary structure the tRNAs but if you look at it at the tertiary structure you see that it doesn't really look like a clover leaf it looks more like this like twisted up thing which has the has the the amino acid here right so the amino acid is here and then you have this t arm you have the d arm and then you have the anticodal loop which is here so hey it it looks very different in 3d than it does in 2d and 2d looks even different from like 1d which is just all the base pairs behind each other so of course one of the goals of bioinformatics is to make tools to predict the secondary and tertiary structure of rna right because the structure determines what it does and if you know the structure then you can kind of infer the function of it and you cannot really infer this function when you just look at the ac tgs right so what the main goal is is then is to take the sequence either dna or rna and create a highly probable annotated group of secondary structures and this is generally done using low energy so we look at free energy of the structure because it's always in water and so it folds upon itself and then it reaches a certain equilibrium which is the lowest energy state and then have what we want to do is predict a secondary structure and that can be done using different web web servers there's a lot of web servers available which allow you to just input your primary structure or your primary sequence and then does a prediction of a secondary structure and then it will annotate the secondary structures saying that well this part of your rna protein might be binding dna uh this part might be protein binding right so you can kind of get an idea what your rna molecule is doing by just inputting it in one of these web tools and then seeing what the web tool says that which different domains there are right so this modular system can be used or can be exploited in a bioinformatics analysis to kind of up in each show predict what your rna molecule might be doing it might be binding proteins might be doing conformational switches and stuff so i just wanted to show you the rna fold web server because it's actually a pretty easy website to do um so hey of course this is how it looks like um let me check if it still looks like this yeah it still looks exactly like that um so hey it's a very basic two-step process you just paste your sequence um or you upload a fusta file if your sequence is too big um you can choose different algorithms so you can say well minimize the free energy and then do partitioning or minimize only the free energy so that's not that important on hey if you want to learn more on how this works they have a help file which actually explains all of the different algorithms and how they work um what is important is that generally if you use big if you do long non-coding RNAs this takes a while so fill in your email address and you will get a mail when the prediction is done um instead of having to wait on the website for a couple of hours until it finishes the prediction um so hey rna fold one of these web servers which takes the primary RNA sequence and then predicts the secondary or tertiary structure and in this case it's the secondary structure that it predicts so i just wanted you guys to try it out right so um when we when we think about sars um then had the the sars gov 2 virus has an envelope protein um and the protein of course works because it has a certain structure but of course the RNA coding for this envelope protein right which which the virus uses so the RNA of the virus is by our own ribosome is producing this protein does this RNA have a certain structure right because that might be interesting because if this RNA is cataclytically active it might be that the RNA is actually doing the damage and not that the protein is doing the damage right because inside of the cell hey the RNA might be binding all kinds of important messenger RNAs and degrading them or it might be causing damage to other proteins um so what we do is we we just go to ncbi to get for example the um the gene so we go to the gene because we want to get the RNA sequence of the sars gov 2 envelope protein um and we told i chose the envelope protein because it's the smallest one right you could do the same thing for the spike protein which everyone always talks about um but i think that the envelope protein just works a lot better right so let's just click the link and just um fill it in um i actually think that i already did that no i didn't do that so let me show you guys my firefox window and then we go so we say sars gov 2 envelope protein it takes a little while it's a big database so it searches through the whole database and then um i think yeah all right so there it is so when we look at the output it looks like this right so we see here that indeed we have the envelope protein and we see the envelope we see the spike we see the nucleo capside we have the matrix so the membrane like a protein we have the small envelope um so let's just take the first one right so when we take the first one we click here and it will take a little while so then it will give us some information right it will show where it is encoded into the virus so you see that here is the spike protein and then we have an open reading frame so that codes for a protein that we don't know yet uh we see the envelope then we see the the the other protein which also goes into the membrane and then we see that there's another open reading frame so another protein which codes for something here we can see the envelope itself right so we it it it also does a homo pentamir prediction here which we don't care about we just want to get the sequence so we can do the structure prediction so getting the sequence just means clicking the fusta right because fusta is the format in which we transfer dna or rna code from one side to the other so we just click the fusta file takes a little while and then here we see the the sequence right so we can just take it and we are going to take the description as well so we're just going to copy it we go to the rna fold website we just plug it in here and then we just say proceed right and then it it just starts and does the prediction of the secondary and tertiary structure this will take a while and i didn't fill in my email address but fortunately um it's actually doing it quite fast oh okay it's already done or is it not so what it will give you is it will give you like a whole bunch of graphical output um but it will tell you that the rna structure looks more or less like this right so you can see that it that it starts and it goes into kind of a circle with all of these these pins and then based on the color the color tells you how likely it is that this is actually the structure and here we can see that that the structure of it is between zero being highly likely and one being being very unlikely to to bind to each other um so here we can look at the base pair probabilities we can also look at the positional entropy so base pair probabilities means how likely are two base pairs more or less binding each other the positional entropy looks a little bit different because this is the this is the um the likelihood that that two things are actually close to each other which is different from binding and you see when we look at the positional entropy we already get more certainty but if we look at the base pair probability the only thing that the prediction is really certain of is that this part of the protein kind of looks like this but it has no real idea of the other structure part so here is the one and then we have a slightly different prediction as well because this is the the mfe secondary structure then we have the centroid so those are different prediction methodologies and then how we can also see the the mountain plot yes so this is the the thermodynamic structure yes so there's a lot of information and if you want to know exactly how everything works you can read the paper here and they will tell you exactly how they do their prediction and and what the output exactly means but I just wanted to show you that it's relatively easy to do yes so we search we get the sequence we do the prediction and then we get our results and so you can see that indeed the base pair probability is relatively low but we can see that it's relatively certain about the positional entropy so where the different things are located when you would put it in water and so if we look at a tRNA prediction we can see that when you do a tRNA so a tRNA has a very specific structure because it needs to have this structure and then you can see actually that here it is very certain about the structure and the only thing that is not really certain about is this this part of the of the RNA molecule where the amino acid is attached but it is very certain about about the clover leaf structure that comes out well compared to the here you would say there's not a lot of structure like the algorithm is not really able to determine what's going on and how this RNA molecule would look like when you when you look at the secondary structure while here it is really certain about the secondary structure right so the conclusion if you would look at it to say that there might be some structure to the RNA encoding for the envelope pro time but be aware that all of this one of these structure prediction tools and we will get back to it when we talk about protein is that because you are asking the tool give me a structure it will find a structure and that is also why it uses this color coding to kind of indicate how how certain it is about the structure but of course since you are asking it give me a structure it will give you a structure because that's how these tools work and so remember that RNA or messenger RNA or ribosomes they are never a linear stretched out fragment like they are drawn when you are looking in textbook RNA functions because of its structure and because of its structure it can have a catalytic activity and these structures are actually coming back in multiple RNA molecules and they have the same function when they have the same structure even if they are completely different RNA molecules and so it is a 3d molecule and RNA functions like a lock in a key so because it has a structure it fits into the lock and it can turn the key because it has it it's a key and it fits and it turns right so the structure is the thing that matters and not so much the sequence of course the sequence determines the structure in a way but there's a the the the thing that it does it can only do because it has a certain structure so a word of advice and then we're done don't blindly trust any prediction without experimental confirmation and there is actually a really nice website from the university Leipzig which has all of these tRNAs so all of the 26 tRNAs which are available in humans and in mice and in other animals it actually has validated structures for them so it's not just prediction not just but there's also people that do that do RNA structure and they measure RNA structure using x-ray crystallography so and there's also a lot of validated data about there and and don't blindly trust the prediction always see if there's any experimental confirmation and they take the university Leipzig doesn't only do tRNAs they also have other RNA molecules all right so that's it for today I told you a little bit about the history of RNA I told you about DNA versus RNA so the differences the five main differences being double-stranded single-stranded and the base pairs um we talked about mRNA the different types of mRNA which took up a large part of the lecture so be aware that HN RNA is immature messenger RNA that mRNA is mature messenger RNA that SN RNA is different from SN O RNA right SN RNA does splicing SN O RNA is involved in nuclear RNA is involved in the post-translational modification right so had there's there's all types of different RNA and you need to kind of know what each type of RNA does we talked a little bit again about mRNA expression that you can use micro rays and next generation sequencing and QPCR to measure the expression of genes we didn't really go into detail about the bioinformatics analysis but that will be part of the assignment we talked about RNA sequencing a little bit that it's actually the same as DNA sequencing with two additional steps making your RNA into DNA and then just doing DNA sequencing but I also showed you one of these examples from how you can use like the IGV to see what is different between these three different mouse strains and that one mouse strain has a deletion in the coding region making the protein non-functional while the other two mouse strains they had mutations in the five prime untranslated region which then regulate the expression of the gene so they still make a functional protein but they don't make enough of the protein I told you as well about free micro RNA data and about sequencing data so that you can get that from GO or the short read archive and I told you guys about RNA structure prediction so how can you predict the structure of RNA using RNA fold and of course like these are only singular tools right there's there's literally probably 50 to 100 different tools which allow you to do RNA structure prediction both secondary as well as tertiary structure and all of these tools they have their advantages and drawbacks but I want to stress never blindly trust a prediction right a prediction is only an indication that this might be but always see if there's validation available so that people did like real experiments like x-ray crystallography to make sure that the structure that you get is is is real all right so not the very long lecture but I think a very difficult lecture in a way because there's all of these different types of RNA and these kinds of things so a tipper thank you for following thank you thank you all right so that's it for today I will stop the recording and then we're off so people on YouTube see you next time people in Moodle see you next time and bye bye