 Welcome everyone to lecture number 12 I would say for my mind. Today bioinformatics gene expression analysis. I'm very very excited about it. It's always fun to talk about gene expression analysis. We do a lot of gene expression analysis in our group so it's something that I'm pretty familiar with. I'm just gonna give you a very cursory overview. If you want to know more about gene expression analysis just put your questions in chat or if you're watching this later on YouTube put the questions in the comments. There's a lot of things to a lot of little edges and things like in gene expression analysis like no analysis is the same so if you're working with gene expression data and you think like perhaps this guy could help me let me know. So for today we will be talking about gene expression, the questions that are coming up when you do gene expression or why you want to do gene expression in the first place. I will be talking a lot about microarrays. I'm not gonna talk a lot about gene expression by QPCR or gene expression by RNA sequencing. Of course there's many different ways or actually there's three major ways to measure gene expression but I want to focus on microarrays since that's one of these techniques that I have a lot of experience with. It's not the latest and greatest or the hippest to do but I think it's still one of the best ways to get an overview of gene expression. We'll be talking about normalization techniques and why they are important and what kind of normalization techniques normally are used in gene expression analysis and then of course statistical analysis like how do we test if a gene is differentially expressed, how do we deal with things like multiple testing and then I want to go like a step further and show you guys what you generally do after gene expression analysis, things like gene ontology or pathway analysis. We already talked a little bit about pathway analysis when we had the metabolite lecture I think when we did keg but of course like gene ontology is one of these topics that you should always touch on when you do gene expression analysis. Of course it's not limited to just gene expression. Besides that I wanted to talk about common visualizations of gene expression data so first some kind of novel visualizations like heat maps and dendrograms. I will explain to you how to more or less make a dendrogram by hand so we'll be talking about here hierarchical clustering and the different ways of doing this and then I also wanted to show you guys some of the historical plots like the MA plot and the vulcano plot. Although the vulcano plot is not really a historical plot it's more of a kind of plot that people still use nowadays. And then at the end of the lecture I will show you how you can get a bunch of free microarray data so that you don't have to ask for hundreds of thousands of euros in funding to do your own experiment but there's databases out there which allow you to download like tons and tons of microarray data for free. Good but before all of that the exam. I'm very sorry guys but the exam is coming so it should be in Agnes so if anyone already registered then please let me know if it worked and if someone has access then see if it's in Agnes. I think I have 10 spots reserved. I think that that should be enough for all of the people. Xanakin says yes it's there. All right very good already registered. Good so then you took one of the 10 available spots. If we get more than 10 people signing up right if you try to register but it says like no this date is already full send me an email as soon as possible. Good so it's there so yeah I'm study hard. I have to do the exam orally so I'm still a little bit unsure if I will do it in person or if I will invite you guys via Zoom but that's that's for me it's relatively egal like I can I can sit in my office and you can sit three meters away because of Omicron which is much more contagious so we just up the distance a little bit and then I can just ask you some questions answer all questions correctly and you get a one. Don't answer any questions and you get a five or a six. I don't know how the German system works but something like that. So yeah register now and the exam will be on the 17th at 2 p.m. if anyone has a scheduling conflict right if you're not available on the 17th at 2 p.m. also send me an email so that I can figure out how to do this. All right so with that out of the way let's go to the solutions of the previous assignments. So actually I looked into the assignments and I saw that I gave you a wrong link. I'm very sorry for that. It turned out that the link that I gave you guys for Clustle W was old. I hope everyone was able to just Google Clustle W and click on the first Google link but that was that was a little bit silly of me that I didn't check the link in the in the document that I gave you guys. I'm very sorry about that but I hope everyone was still able to do the assignments. All right so the assignments for lecture 11 begin with using Clustle W. So the idea was to go to the website that does Clustle W. I gave you a bunch of sequences at the end of the document and the question was question number one is to which sequence is our query sequence most related. So let's show you guy a Firefox window. I think I already have Clustle W open. Yes I do. So I'm just using one from genome.jpa which is in Japan. The old link was in Europe but that's not available anymore like I told you guys. All right so first not fiddle with the options and use the default settings upload the sequences from the additional section below and run the analysis. This will only take a few moments to complete so I hope that again this will be this will be the case right. You never know with free databases. So these were the sequences like there was a query sequence then beta lactamase precursor, beta lactamase from bacillus and a beta lactamase from another bacteria and then a dehydrase. So these are of course genes. Here there's actually a space missing so let me fill that in and let me put a space for query as well and write it with a capital letter. And then I'm just going to say execute multiple line right. Not fiddle with the settings just see what happens when we just run it using the standard settings. All right so it's it seems to be running. Oh it already finished. That's really really quick actually. All right so here we see the output from cluster W right. So it says first that okay I detected that you are trying to do multiple alignment of proteins and then format is Pearson that's not correct. The format is actually FASTA but that should be okay. It lists the amino acids just to make sure that it understood what you gave it and then it starts aligning the sequences pairwise right. So the first step in in any multiple sequence alignment is to pairwise alignment of the individual sequences and then it creates something called a guide tree which you can actually click on. So if we would click on that it would try to download it and then I'm just going to open this with notepad++ for some reason that's not the default. Why is that not the default? All right so do that and then okay so that just gives me a single sequence alignment that's not what I wanted. Why is this window so weird actually? Good but as so it does pairwise sequence alignment you can you can download the guide tree and then the guide tree will more or less show you an overview of how the sequences are related which it does like this. Of course this is a guide tree in a way that a computer can understand so here we see the query and then we see that the query is closest related to the beta lactamase from bacillus and then less but had these arrows here they denote like branching in the tree but it's not a perfect format. All right so let's go back to Firefox then here we can actually see the alignments so the alignments can also be downloaded in a file. Let me actually do that as well and I do that then I can open it up with notepad and then of course we just get in notepad the alignments that we have right so it's just the the multiple sequence alignment. Good so question number one was to which sequence is it most related. I think it can actually show you the tree but that will take a little bit of additional time but it then it will give you a tree which looks like a tree right but here we see that when we look at the sequence we can see that it kind of orders them based on the distance between the sequences and we can see that our query sequence is most related to the beta lactamase 2 of bacillus and then there's another sequence which is relatively closer related. Here we have it good so here we have our query sequence beta lactamase bacillus and then the beta lactamase from bacteriolus right so if this query sequence would be an unknown sequence to us then especially based on this tree we would say oh this is probably a beta lactamase 2 gene. All right so first question answered second question is how many amino acids are identical between all sequences right so identical between all sequences means that we're looking for the stars right in the multiple sequence alignment because a star means that they are exactly identical so fortunately I downloaded the alignment in yeah so I hope I'm live again because for some reason my entire OBS just crashed like hard it it just froze up and didn't let me do anything. Is everything okay? I'm still getting some lag on the webcam and I think on my voice as well but that's not good not good at all. The hell is this thing doing? I hate when this happens. I hate when this happens. Nova Vego thank you for subbing or thank you for following actually while the whole stream is crashing going picture and picture and so actually let me see if I can figure this out. All right that looks better right so people can see me again can people hear me as well I think so I hope it's not too weird I'm getting like weird if you try turning it off on and off again yeah I did I had to force quit it okay so my moderator sees me and hears me so that's good I am getting massive warnings about frames being dropped from Twitch I have no idea where that's coming from but we'll try to muddle through so hopefully people can audio and videos just fine good okay so then it's just me that's not seeing it all right cluster W right so we did our alignment the second question is yeah so see I think something went wrong when I tried to switch from Firefox to notepad so the question is how many amino acids in the sequence are identical so for cluster W we know that a star means an identical sequence so when we go to notepad plus plus we can just say find the stars right and then just count them so if I count the stars then I see that there are three stars so across all of the sequences that we have there are only three amino acids which are shared exactly between all of the sequences and here you can already see that something kind of went wrong right these sequences are too distantly related for us to start aligning them have because you can see that there are large stretches where all of the sequences are different so this alignment is not a perfect alignment or it's a far from perfect alignment so to speak alright so the next question is how many amino acids are highly conserved between all sequences so those are the double points so let me select one and then just count them so those are 20 right so there are three amino acids between these five sequences which are identical and then there are 20 which are highly conserved and of course that's that's very low especially considering the fact that if we look at the length of the sequences in Firefox right so if we go to the alignment and we see that that's like 250 amino acids and the longest one is 326 right so here we have one of these pitfalls from multiple sequence alignment right we just take sequences we throw them in cluster W cluster W will do its best to align it but it won't give us a message saying you know what these sequences are too distantly related or these sequences are not related to each other right because only 10% of the whole amino acid sequence is shared or is conserved between these and in only one percent of the whole sequence is identical so these these sequences are way way too far apart for us to align them or to be supposed to align them all right so now change the parameters one by one put one extremely low and high values see how it affects the alignment can you find parameters at which the alignment is nonsense so in theory you would say well we already started with the alignment and the alignment that we did using the default parameters already seemed to be a very nonsensical alignment all right so let me go back so how we can just go back and we can change all of the parameters have for example we can say gap open penalty normally you get penalized by a score of 10 but we can say penalized by like a hundred right and then when we execute the multiple alignment had then we see that we get an alignment again but now the alignment is more or less gapless right because now it is so expensive to open up a gap that cluster w will more or less want to put the alignments at the beginning right because now opening a gap inside of the sequence will give you a penalty of a hundred so the algorithm will not open it would rather in this case try and do more or less a global alignment instead of a local alignment has so to kind of just align the sequences to each other and of course you can change all of the parameters one by one and you can also say well use a different matrix right so you can use for example the pen matrix and then gap open penalty actually went back to the default so if we use the palm matrix then we see that the results again are slightly different I think it still uses the updated one but what you can see is that like these sequences are very very distantly related from each other and the conclusion here is is that we probably were not allowed to align them to begin with all right so then the next question download the myostatin gene DNA and protein sequence form ensemble for a human gorilla mouse chicken and a species of your own interest so I hope that actually people were able to do that let me see if I actually saved the sequences or that we have to download them again no let me see no so I haven't saved them to my hard drive but that that's okay so we can just go into ensemble so let's go to ensemble and then we wanted first human so we are just going to say human and then we are just going to say myostatin all right and just get the protein and DNA sequences so MSDN is the gene and ensembles again a little bit slow today but it's at least doing something we're just going to export the data from the main page we're going to say give me a faster sequence give me the feature strand that's okay because we don't care I want to have the unmasked you could go for the mask sequence as well I want to have the cDNA right so the well let's go for the coding sequence I want to have the peptide sequence I don't care about the introns or the exons so don't give me those do next and then give me a text file right so now here it gives me the DNA sequence and then it gives me the peptide which is the peptide that the myostatin gene so let's just take the first two right here we get the genomic sequence and of course the genomic sequence is way way longer because it also contains the introns so that's a little bit of a shame but we don't need that one so we're just going to take the coding sequence right so the part of the gene after it's been transcribed into RNA after all of the introns have been removed so we're just going to take this and then go and put this in notepad plus plus so let me open up a new file and that's the first one and I'm going to show you guys and I'm actually going to rename this right so I'm just going to say human MSTN DNA and I'm going to do the same thing here and now I'm going to say protein just to make sure so that's the human one and let's do the next one so the next one was not human but it was mouse so let's go to mouse let me show you guys what I'm doing so again going just to mouse I selected mouse I search for MSTN myostatin and I press go just take it from the reference strain and then it needs to load a little bit I'm saying export data and again I want to have faster sequences I just want to have the coding sequence and the peptide sequence which is selected I just press next give it in text format and then I take the first one right and here you can see actually that there's something interesting going on in mice mice actually have multiple versions of myostatin and this is because a mouse or the mouse myostatin gene does multiple or produces multiple different proteins from the same sequence so you can see that this one is relatively long but it also produces a shorter myostatin gene so there's two different different proteins being produced from this one but they take the longer one right because we want to compare it to humans so we want to take the one which has more or less the same then we go to notepad plus plus and we just add this so we just say this is mouse MSTN and this is DNA let's copy this and let's paste this and then we say this is the protein sequence all right then next one would have been gorilla I think so let's go back let's just go to the home page then we need to select species which is gorilla and then we search gorilla which is actually I always love the Latin name for gorilla which is gorilla gorilla gorilla so it's funny like mouse is most muscular but gorillas they have the best one like we are homo sapiens but I don't know why we're not like sapiens sapiens sapiens or something like gorilla gorilla gorilla anyway MSDM next gene MSDM from the western lowland gorilla we just click on it and we have to wait a little bit for ensemble to load a little bit longer if we have this much waiting time we could have just used biomark I think in R which would have been quicker just say export data it should have selected what I wanted and I'm just going to press next and then say for the gorilla give me the text file as well which again takes time yeah so here actually we can see that gorillas just like humans only encode one protein from their mere statin gene so let's copy this and also put this one in notepad plus plus let's go here enter and then this one is a gorilla MSDM DNA right and then we have gorilla protein protein alright so then which one was the next one so the next one was human gorilla mouse pig alright so pig go to firefox ensemble pig breeds just got to which pig do we want which big do we want anyone got a favorite pig they don't have the goating a mini pig actually that would be interesting but let's just take the Hampshire pig that's kind of a standard pig right up wanting to view an example location can I not just search for pig yes so I'm just going to change it here to pig and going to say MSDM alright let's get the reference genome instead of the first one which is a certain breed of pigs actually don't know what the reference is for the pig but we shall see alright so we export the data we just say okay give me everything that we wanted again the same thing so just a gene sequence then go to text and then take the first one again pigs like humans and gorillas only encode one I'm just going to again put this in notepad plus plus so I'm going to say pig MSDM and this is DNA and then I am going to say here protein alright very good so then there's one left let me see which one is left chicken chicken chicken chicken alright go back to ensemble say MSDM and I want to search this in chickens alright there we go alright so here you can see oh you can't see that but here you can see that actually MSDM is still called GDF8 we talked about this last week that genes have multiple names and in chicken actually they did not rename the myostatin gene to because the gene symbol is still the old gene symbol so GDF8 again we say export data and then we say next and we want to have the text fusta and we take the first one so interestingly enough chickens also encode myostatin like humans gorillas and pigs alright so we go to notepad plus plus we say chicken MSDM and this is protein DNA sorry we copy this and we just name rename this one and then we say protein good so now we got all of our sequences I'm just gonna save this file for later use because I'm going to use it again so these are protein sequences I'm not allowed to save there that's okay save it to my one drive alright so now I have my file with protein and DNA sequences so let's see what the assignment is so using clusterw analyze the overlap in DNA sequence similarities between these six species so question to you guys in chat do we want to use R or do we just want to use the online tool so there's only four people watching so the first one to shout gets his will or gets his way so just do it using Firefox or do we want to write a small R script which does this I'm just gonna use R I like R a lot more so see that's what I thought Xanax in we do are good all right so first things first I need to figure out where I save my file but just started a new file right so I'm going to set my working directory to where I save the protein file which turns out to be C double-punk users slash aren'ts think yep that's correct all right then let me show you guys the R window as well and let's set our working directory there and then we're just going to load in our file right so we're just going to say read lines because yeah why not just use read lines plus plus right because we just want to load in the whole file the file was called protein sequence start faster and then this is content or some other variable name right let me save this actually as a script dot R so that we get nice code highlighting let's call it script to dot R and let me see what happens right so now if we look at F content it gives me a warning that there's no enter oh it gives me a warning that there's no enter at the end of the file which is true but this is kind of what we have so had these this is just a content from the file 140 lines in total and we see that we have the descriptions as well so the first thing that we need to do is of course we need to split this file because we need to figure out so first question was do the DNA alignment right so we want to have the DNA right so when it says DNA in the string then we want to get all of the DNA codes up until the next one so the first thing that we have to figure out is where though the sequences start and end right and start of a sequence is denoted by this larger than symbol then the next line will be the sequence all right so how do we do this so in R when we go to notepad plus plus we want to figure this outright so I'm just going to say grab and then grab this larger than symbol in F content right so grab is something that allows you to search in in files and when I do this then it tells me that the first line has a greater than symbol line number 21 29 24 49 and so forth right so I'm just using the grab to see if there's this greater than symbol because I know when I see this greater than symbol the next sequence will start so I know now that the first from from line 1 up until line 20 there is a sequence that I want to use so that's good so let me store that somewhere so so sec start right or sec s let's give good variable names so these is where sequences start right and then of course sequences end it that's just where the sequences start minus one right because the end of the sequence is where they are so let me see if this works so let's just go and go back to our copy paste in the little piece of code that we made so sequences start at these positions sequences and at these positions and now we see we have a little issue because the first sequence of course goes from 1 to 20 and the second one goes from 21 to 28 and the problem here is is that the last one is is not ending at 133 but the last one is actually ending at 140 so I need to adjust my second thing right I need to drop the first one because that doesn't make sense and then I need to add the last one which is just the number of rows or the number of lines that I read into the file so let's do that right so we're just going to say okay sec end is sec end I'm going to drop the first one so minus one so that drops the zero from the from the line that we have and then I'm going to say sec end is combine sec end right with the number or the length of F content right because though that's our file content all right so now when I would look at this then that's the wrong one let's go back to R so now when I look at this then I now have sec start right start positions sec end should have the end positions and now I can do for example a C bind for this to combine these two together so I say sec start comma sec end right now I get a little matrix right so now it tells me that the first sequence starts at one ends at 20 the second sequence starts at 21 ends at 28 all right so that's nice but now I still need to figure out which sequences they were right so fortunately I actually like the first line so where the sequence start is the annotation so I can just say F content so let me go back to notepad plus plus since we're writing a little script so I can say F content at the position sec start right then this would give me the names of the different sequences wrong button again so this would give me the names of the sequences so these are the names so I'm just going to use this C bind to make a little matrix so let's use a little matrix so my sex or now my sequence plus right my sequence positions and then I'm just going to use this as the names so I'm just going to say row names of my sequence plus plus is this right so now in theory I should end up with a little matrix which has the names on the on the rows so m sec boss so here we see the names and then we see where the sequence starts and the sequence ends and now of course I need to add one to where my sequence start because the DNA sequence for human me use that teen does not start at one it actually starts at two and are makes this really easy for us because I can just say okay so now I do m sec boss right and take the first column and say plus one and then add this back into the first column so I'm just going to say m sec boss one is like this alright so now everything should be okay so now I should have my little information file which will tell me where each sequence starts and where each sequence ends so my sec boss and now this should be okay right so the human DNA sequence for my statin starts at line number two ends at line number 20 the protein sequence 22 to 28 alright so now next thing is to kind of get the sequences right so I'm just going to first add an empty column to my matrix right because I want to have a column where I can store the sequence so I'm going to say m sec boss right is m sec boss and I'm going to column bind so I'm going to say find a column and I'm just going to say column is called sequence and initially I don't know what the sequence is so I'm just gonna fill it with a missing value so when I do this and I do this in R then it looks like this and it now just has an empty column for the sequence that we're going to add alright so now let's extract the sequences so we are going to just extract the sequences so we're going to say 4x in 1 2 the number of rows of this msec boss thing right which is my annotation file well what do I want to do well I want to get the the start and the end position right so I want to get all of the lines out so I'm just going to say for line in msec boss right so x comma one which is the start position I could have used the sec start so let's just say I want to have from column sequence start right and I want to go to the sequence end so that's how we do it right so now I have a second for loop so x is the current information line that we're looking at and then line here or let's just call it L the variable so L means which lines am I going to do so I'm just going to build up my sequence I'm going to say initially I have a sequence which is empty right so there's nothing in there let's just call it sec let's call it sec and then the only thing that I have to do is now take the line out of the file right so I'm just going to say file content give me line L and then add that so paste zero the sequence that we already had with the file content of line L and then let's cut the sec and put a new line behind it right just to make sure and now of course in the end let me first run this for you guys so that you can guys can see what happens so it will just print the sequences that we have and of course these sequences will be relatively long especially the DNA sequences the protein sequences kind of look very similar in length that's how you build it up right and we're writing a script which is generic so we can use it later on so good so let me go back to notepad for you guys so of course we have to store this Seiku variable now into my sec pos so I'm going to say that msec pos line X or row number X at the column sequence just put this thing in right so that we remember it for later and I'm going to disable printing to screen because I don't want that all right so let's run it right and then when we look at msec pos then it now looks like this right so we have the start position the end position and then the sequence good but now we have DNA and protein sequences but again we can use the grip trick right so we can oh sorry you're looking at the wrong so now it looks like this right so we have to start and and the sequence and now we can use this little grip trick again to say well I only want to have the lines which start with the name or which have DNA in their name or I want to have protein right so let's use this little trick so I'm going to say grab DNA comma row names of msec pos so these are my DNA sequences right so this should tell me that one three five are the DNA sequences so should be okay one three five seven and nine and the same thing holds for protein so let's actually do a so now we can make our vector right so we can make our vector which now contains the DNA strength so I'm just going to say well from msec pos right take these entries at give me the sequence right so this is just going to give me a vector oh it's the wrong button so this is just going to give me a vector with all of the DNA sequences in there and of course I now have to put the names on there because I want to remember that it was a gorilla or a human or these kinds of things so these are if we go back to notepad so I'm just going to say these are my DNA sec right and then I'm going to give him names and I'm going to use the same trick so I'm going to say names of DNA sequences are the row names of msec pos and then at these positions those are the positions that I just did all right so if I would do this in R then now if I would look at DNA sec it would look like this so if DNA sequences and somewhere in the back it will tell you human meostatin is the name for the first sequence mouse meostatin second sequence and so on all right so now what we know is that we now can start multiple alignment right so when we do multiple alignment we need library msa and then we need to make a DNA string string set out of my sequences right so DNA sec and then we have a DNA string set so DNA SS or something like that so let me run that see if everything goes okay all right so it's loading all of the required packages and now I can make a DNA string set so now when I look at DNA SS it will tell me that we have a DNA string set we have a length of five here we see the length of the sequences themselves sequences and then the names and now we can start aligning so the next step would be is now to do a multiple sequence alignment so we say msa of our DNA SS and then this is my alignment all right so go back and do the alignment and then it says using the default substitution matrix and now we can see that the alignment is done right so it's you see that it inserts some a little gap right because here you see ATG so it aligns all of the ATGs and in mouse for example we have two ATGs at the beginning and here we actually have an ATC ATG for gorilla and I'll have already it sorted it kind of to the to the to the sequence we can also see that it actually comes up with a consensus sequence good so now I forgot how the second art trick works so that we can cluster them so I'm just going to look that up from the previous slide so that will take me a little bit of time so I'm going to say documents and I want to see my PPTX I want to see biofematics 2021 and I want to see my old lecture all right so let's go to the old lecture then the trick for doing this is all the way at the end so making yes and I have to do library second are so let me show you guys what I'm doing so I'm just taking the code from the lecture right so I need to load the library second are and then I need to convert the alignment that I did from an MSI alignment to this is called this is called alignment so I have to convert it to a second our alignment so I'm just going to override the alignment file then I want to cluster it based on the similarity which is also code from the previous lecture so I'm just going to copy paste this in so I don't have to write it so make a distance matrix and then I want to do a clustering and a plot so again just copy pasting the code from the slide that we showed last time because I don't know this from my mind so have once you've written the code then you just reuse the code alright so this is the first thing so let's just see how the clustering looks like right so see which sequences are most related to each other so how we load so have we make our alignment using the MSI library then we use second are to get the similarities so how we want to cluster by similarity and then when we plot the clustering it actually looks like this so it tells us that the human me was studying DNA sequence is most similar to gorilla we see that pig is the closest towards the humans and gorillas then mouse and then chickens are furthest away so they are most dissimilar if we want to make this look a little bit better then we can say plot clustering and then we can say hang is minus one and then it will actually like pull all of the lines down right because what it does it tries to infer like when or biological age and stuff but we don't want that we just want to have everything on the same line if we want to make it look a little bit better then we could have done that as well and then we need to use the ape library so let's just do that as well because I do like it to be a little bit better right because normally when you do these phylogenetic trees you want to have them show in a horizontal way instead of the way that we're looking at it now so let's go back to notepad plus plus we are going to use the library ape we are going to transform our clustering into a dendrogram then we are going to use the phylogenetic tree right so we're just going to convert it to a phylogenetic tree and then we are going to plot the phylogenetic tree actually we don't need to have this dendrogram call at all right so now to clean up code right because let me actually show you guys how this looks right so now I'm instead of using a standard dendrogram I'm just going to make a nice like a radial plot which looks like this and I can have different types so let me actually look plot file we can use type is Kledogram right so we can also use a Kledogram if we wanted to right then this is normally how you would show stuff in a paper right so we see that gorillas and humans are closest related but it doesn't change the tree at all because of the fact that but it's just different visualization we can also show it as a fan if we wanted to apparently I'm just looking through the help file which options you have so this is a fan which kind of makes it a nice circle and then we have unrooted as well which shows it as an unrooted tree which is more or less similar to the radial plot that we saw before so in this case probably the Kledogram is the best way of showing it and then this is the way that we do it for DNA sequences so let's go back to the code clean up the code a little bit right because like it's nice code so the first thing that I always do is like we use three different libraries right so I'm going to like move the libraries all the way up right so I'm going to load oh I'm going to load the libraries very very first time right so say I need MSA I need second R and I need a library called APE right and this is just for anyone using your code they can directly look at your file and see oh I need to install three libraries and these are the three libraries if they are at line number 20 then of course they have to scan through the whole code to figure out which libraries they need to install right so I can also add a little bit of tech so use R to do multiple sequence alignment on DNA and protein sequences right because we also wanted the protein sequences good so now we have to add some comments right so first thing is set by day and load the file then here figure out where sequences start and end right then here make a matrix two columns start and end then move start column to one line below the annotation because that's just the way that our faster file works here we create a column three for holding the sequence right through the matrix we create it create it and the sequences from the file alright and then here we do first step so this is step one figure or take the DNA sequences and align them use APE to make a what did we want to make we wanted to make a cladogram in this case alright so here we have the names then this is the alignment alignment here we have cluster ring and distances and then because that just looks nicer alright so that's the way that we do it right little script 50 lines of code well not just code but we also have the alignment better typing today a line it's still not correct alignment alright that's how we write it good but that's how we build up this file right and the main thing here is add the comments because if you don't add the comments then no one knows what's going on right and now people can kind of follow and I'm recording this so in theory I could show people the recording but and of course I want to add my name to it as well and then I'm going to say this is lecture 11 or at least answers to lecture alright and now of course if we want to do a little trick we can actually hear instead of gripping for DNA we can also grab for protein and then we could do the protein alignment and this is how you build up a little script this is how you do the analysis and I could actually take this one out right because I can say rows with with selection right put it in a variable and then say rows with selection here and then here I also can do this right just reusing the variable and now I can say protein and I could do AC and then do the whole thing again so every so remember the cladogram right this is based on the DNA and now we can just run the whole code again and then it comes up with an error right because I'm making a DNA string set from an amino from amino acids right so in this case I want to say AA string set so if I go here right then now I also need to update this line of code and say AA string set and then it would do the alignment for the amino acids and then we see that we get more or less the same cladogram right if we run the code now it it I just changed amino acid string set and here the amino acid string set now fixes the error and then we get the protein alignment so that's how you do it it's just like going one by one just saying okay so I want to do this and you do it and of course this is relatively advanced right you you need to know things like the grep function the row names function you have to have an an idea where you want to go and of course that's really hard when you're starting out programming and this took me a couple of years actually to kind of build this up like this right and you can still see that I don't know anything everything from from my mind because had like the whole thing for this part I just copy paste it from the slide because I could have figured it out but then I have to go through the help files and I already did that like four or five years ago when I was designing the course so that's how we do it all right so what were the questions analyze the overlap in DNA sequence similarity and you can overlap protein sequence similarity as well and of course you can just do that by when we go to our if we want to DNA sec is not wrong because we're looking at the amino acids but here we can see the different sequences and if I would just do the alignment right and I would look at the alignment then the alignment would tell me also the consensus sequence so by then looking at the consensus sequence I could figure out how many amino acids are identical between the sequences how many are more or less highly conserved and these kinds of things right so it's it's it's just building up this little phylogenetic tree and then we can learn something about evolution so we can see that humans are indeed related to monkeys and pigs are related to humans relatively closely as well right but you can see that humans and gorillas had a common ancestor and the common ancestor between humans and pigs at least when it relates to the myostatin protein is further back right and mice are even further back in time and the common ancestor between chickens and all of these other species is even further back and that's what we learned from it right and this is how you do phylogenetic analysis or how you do sequence analysis good so I've been talking for almost an hour let me go back to the presentation for you guys so I think that were all the solutions if there's any more like questions then feel free to ask of course you can also put it down the comments or just throw it in chat alright then I will switch to the overview slide and then we will do a quick break I will be back in around 10 minutes and then we will start with the lecture so I will see you guys in around 10 minutes which means 205