 All right, let's start with the audio check if you're watching the stream and you can hear me or you Can't hear me, which is more difficult of course Let me know in chat Because then we can start like I'm excited. I'm excited. So I've been having a holiday. So Sitting at home doing nothing really. So I thought it would be good to to do a live stream and I'm wondering if people can hear me Misha, are you there? Can you hear me? My microphone is doing like the audio thing, but I'm still getting this YouTube errors. So I might not No audio, let me just put it in chat so that people can answer me if there's audio if not, then There's no audio Which boot suck badly That's so annoying All right, let me see what's going on here. Okay, so my microphone should be on Huh gonna have to check on my phone just to make sure Gonna have to check on my phone since no one chat is entering which is okay Like you don't have to put anything in chat. Let me see. Let me see Let me open up the sound Yes, but the sound is a bit low and hollow. Yeah, that might be the case. I can put it up a little bit higher But I'm using a different mic So I can put it a little bit closer to my But yeah, that's always the case with live streaming and since I'm doing it from home I don't have the equipment that I have on my work. So that's just a shame, but I hope it's audible and in the end it's just about the programming, right? So I think people will enjoy it And uh, it's not so much about my face or how I look, but um, Let me show you guys the So everything here what we're going to do is just like we're going to use open source databases and tools and things that are available Things that are available To analyze some monkey pox data and why monkey pox? Well, it's hip and happening at the moment So that's the reason that I chose monkey pox. You could do it with any virus Or other genomes But I just thought it would be fun to give you guys an example On what you what you can do with it Because like we've I've been doing the r course and the bioinformatic course And people always wonder or people always ask me like you you give us these little examples to practice on But we never see you making something. Um, and of course making something bigger is something that's difficult when When you're just alone So let's just start right? So I'm gonna open up my notepad window. I am going to open up myself I hope I hope I hope I hope no, this is all wrong That is so horrible. All right. So let me switch that to the correct screen Because I want to use monkey. All right. Now you can see me as well so, um Hoping that the audio is fine because that's that's one of the drawbacks Um Don't have the equipment but good. So for today. This is kind of what we're going to do So if you want to follow along or if you want to do the coding yourself I made a github repository. So, um, let me put the link in chat And I will be pushing my code there while we are Doing the analysis so that you can kind of see how I would normally work on a on a normal work day Um, so here's the live stream. So this is the um github So it's github.com my username and then monkey 2022 or monkey 22 good and with That like in the end we need to start somewhere, right? So I made a little idea of what I wanted to do today So the first question is is where do we get our data from and that is of course a tricky question There's a lot of sources of data out there But since monkeypox has been recently kind of rediscovered or discovered in the uk But also in portugal belgium in the usa there's a lot of new sequencing data coming out and most of the new sequences I found on virology.org which is a website where people share virus sequences and You can kind of publish little articles there on on genome analysis So that's where we're going to get our data from and of course, we're going to use ncbi for known previous monkeypox outbreaks And also other pox viruses because in the end when you're doing phylogeny You're not just doing phylogeny on a single genome. You you want to kind of put that into context And then we're going to load the genome. We're going to get created blast database We're going to load the associated proteins from ncbi Which is one of the issues with new genomes to coming out and that this is kind of the reason why I wanted to make this I wanted to show you guys How do you analyze a genome when there's no information about that genome really available except for the ac tgs that you kind of find on on a on a website, right? and then Because we need to create our own protein models We can use other pox viruses like smallpox or the earlier monkeypox outbreaks To kind of combine this information from these different Sources or from from the from the different protein sources together to create protein models So we're going to use leavenstein distance or multiple sequence alignment to do something like that Then we can create a consensus sequence and then with this consensus sequence We can finally start annotating the new genomes that have been published So i'm really excited about that I haven't seen any work yet The only thing that I saw is that people just take the whole dna sequence And just do some multiple sequence alignment on it, which I don't think is the best way to look at viruses especially one of these Monkeypox virus where you have two of these like hyper variable regions Where genes are duplicated and There's like different versions of the same gene across the genome And then in the end I wanted to show you guys like a little picture of what you can do Um, but with that out of the way, I think the first thing that I wanted to show you guys is How to get your data right because if you're like me or not a virologist You're just a basic bioinformatician sitting at home doing nothing Worried about an outbreak then the first thing that you do is is try to get some sequence data So for that of course, we need to start up our trusty firefox window And that goes wrong again because for some reason firefox also switched to the wrong window Because I don't want to show you guys my live streaming window I just want to show you guys my empty firefox window And then we're going to go and put it away All right. So what are we going to do? Well, we're going to go to NCBI, right? So NCBI is the national center for bioinformatics And we are going to go to the nucleotide core database because we want to download DNA sequences Right. So then you get onto a website. It's a little bit big Which looks like this, right? So you have the word nucleotide because that's the database that we want to search in and then we want to search for something So we're just going to be like filling in monkeypox, right because that's what we want, right? So And then it comes back and it says well, there's a monkeypox virus, which is a double-stranded DNA virus in the family of poxviridea And then here we have, for example, the recently sequenced USA More or less genome, right? So if we click on it, then we see that it's a big sequence So it will download what we see is a whole bunch of information So it's the monkeypox virus isolate. It has a name and this is the complete genome Here is the genbank ID, which we're going to need And this is dot two. So it means that it's already version two. So the Americans published their first draft Very soon from their first patient, I think only like three or four days after the outbreak And then they decided well, we are going to sequence at higher depth and with a higher sequencing depth It means that they have more coverage and that their genome is just better So you can see that the virus is relatively long. It's 197,000 base pairs long It's a direct submission and here we see some more information So we can see that it's a monkeypox virus. It's genomic DNA of the virus This is the isolate. So that means that this is the more or less patient ID from which they got their virus sample and the host was homo sapiens. So that's that's humans And it's from USA MA stands for Massachusetts. I think correct me if I'm wrong, but That doesn't matter too much in the end. We're just going to more or less Collect a whole bunch of these viruses. So how are we going to proceed? Well, we're going to proceed by just putting up an excel file So this is an empty excel file that I created Well, there's nothing in there. So I'm just going to say I want to have a column which is called ID Right and that's going to I'm going to use that So I'm just going to copy in the genbank ID from the USA monkeypox virus And then the next one is going to be well, what kind of species is it? Right and then the host And then we want to know the region So in this case, it's North America And then the country which in this case is USA and the host was homo sapiens And the species let me copy that from here because this is an orthopox virus. All right, so This is how it looks and this is how we're going to treat our data And now comes the boring part, right? Because now we have to go here and then okay So now we got the first from the US, right? We we saved our genbank ID and then we're going to go to the next one So we're going to just click on for example A different one and we're going to see so this is an unverified monkeypox strain from the Democratic Republic of Congo This was not found in humans. It was found in another animal. No idea what animal this is So let me look for that. So it's kind of a mouse kind of thing switz mouse. So, yeah, it's a mouse And we can we can download a whole bunch of these again You see that it's just the DNA sequence Right and even worse here. There's a gap because they They couldn't sequence the whole thing. So we see here that there's a little gap But what we can do of course is find one which is annotated because we have previous virus outbreaks where we do find nice annotation where there are proteins annotated So let me see. Yeah. So here we have one and here we can see that They have annotated genes and coding sequences and genes and coding sequences. So we we also have the protein sequences here But of course this needs to be translated to the new Outbreak because it could be that there's much that there's a lot of difference right between the two Genome might be longer. There might be massive gaps in these kinds of things So in the end, hey, you have to prepare your excel file. You have to add a whole bunch in And of course because I'm a good streamer. I already Prepared some so I downloaded or not so much downloaded But I took my time and I just selected a whole bunch of these box viruses So in total I have like 75 and then I thought that would be enough So we have IDs. We have the organisms, which is all auto pox virus. I put in some other species So these are goats. We have here some birds We have a salmon, which is a fish and for the rest I also put in some cow pox I also put in the akmeta virus. The akmeta virus is very closely related to the pox viruses It's in the same group But it's it's it's not a standard pox virus I added a couple more like vault boxes and raccoons And of course every time you have to go to NCBI look up where the sample was collected What species was collected on and sometimes you can't find this information. I also put in the The vaccine That's the vaccine that's being used for birds And besides that I also think I put in some of the Really bad ones. So these are the variolas range. So that's small pox, right? So they are different But in the end we have this big excel file. We spent some time on making it And fortunately I didn't have to make this one myself completely Because fortunately on virology.org someone actually shared a very small subset of what they were looking at So they think it was the belgium group which published their first genome sequences And then they gave a little table like this Little bit different. They they didn't split it up between continent country and state They just had only countries And they only looked at humans Which I think is Not what you should do, right? If you compare viruses viral gmenoms, they can be very different between different species So you always want to get a couple of outgroups and these outgroups are very important because they kind of give you a way to anchor Your virus into this virus tree And that's why I always put in some goats some cows some some other animals so that we can compare So how do we proceed? Well, we just select everything control a we copy it control c And then we go back to to notepad and in notepad We just create a new file and we are going to save this file. So when I copy paste, it's a it's a Separated file using tabs. So I'm just going to save this into my monkey And I'm going to say monkey pox annotation dot txt. So that's why I'm going to call it Oh, you you guys can see that that I'm saving it But I called it monkey pox underscore annotation dot txt and it's in the data folder So the first thing that I'm now going to do is I'm going to commit this file to github So I'm going to say github add data slash monkey monkey pox underscore annotation dot txt And then hit commit minus m initial data file And then I'm going to push it. So if you go to the github repository, which is um in the in the chat So github.com then the alice monkey monkey 22 Then now you will see this file Being there so that you can use it for yourself Good. So of course to do where do we get our data from so I just got a whole bunch of sequences from ncbi first I'm not going to deal with the newly published sequences because fortunately a couple of the newly published ones are already in there So we have two from the us Version one and version two and we also have the one from great britain, which was sequenced a little bit earlier But has still version one. So it's not it's not a very up-to-date sequence, but we have two of the new monkey pox And we can now also start looking at which what what is different between the first sequencing run from the americans in the second one But first things first we go to our code and we're just going to say We're going to read in this table, right? So I'm just going to do read table I'm going to load in data slash monkey pox comma and notation And notation dot txt My separator is a tab I am going to say that I have a header because I added one and my row dot names is one Because I want to use the ensemble IDs as the row names so This should load in our file without any issues and how do we want to call this? I'm just going to call this source IDs right because I need to get my IDs from somewhere So I'm just going to get them from a table. All right. So let's quickly go to r See if that works. So then I have to yeah, that works So let's go to r and load in the data file. I also have to load the libraries So you can ignore the api key dot r If you query ncbi And you do that a lot then it's better to have an api key so that they can keep track of of how many requests you do You can do everything without an api key what we're doing here today You just have to make sure that you're not overloading their server So if we look at source IDs, then this is what we loaded in so a whole bunch of different different viruses Whole bunch of auto boxes some monkey 22s and a whole bunch of older ones as well And here the table continues that so that's all fine. So in total we should have 75 So let's check that we really have 75. So we do have 75 which is what we wanted Good. All right First things first. So where do we get our data? We just go to ncbi We make a little excel table and now we want to download this data, of course Because we now want to get access to the genomes. So how do we get access to the genomes? Well getting access to the genomes is really useful because we are is really simple because we have this r interest function Which allows us to automatically query Entress. So how does this work? So It works more or less like this and I'm not going to use the api key first So let me show you guys notepad plus plus again. Um, so what I'm going to do is just say entress Entress underscore fetch So I want to fetch from the nucleotide database nucleotide database. I want to fetch an id So fortunately we have an id because we have the row names of source id Ids and we're just going to use the first one, right? So just download one sequence just to make sure that it works And I'm going to say I want to have the fuster sequence back because I love working with fuster files All right, and then I'm going to say M data, right? So my data just just first to check And then we're going to go to r and we are just going to do the query And what I'm expecting is now that I get a long sequence back So if we look then indeed we see that we get like a whole bunch of sequence data And it also gives us the fuster header saying that this is indeed the usa genome that we just looked at This is actually version one. We could download version two as well. The nice thing about The entress fetch function is we want to get the first two We can do the same thing so we can just say one two two and then we get two sequences instead of one If I look at m data, it still shows all on one line So it kind of flips the r window around because it's too much data to deal with But what we can see is that there are slash new lines in the file, right? So the first thing that I'm going to do is say, well, I don't want everything on a single line, right? I want to Make sure that I have my return sequences and that I'm going to split them So let me just call this rsec right for return sequence And then what I'm going to do is I'm just going to sdr split The rsec and I'm going to split by new lines because that's kind of how a normal person would read it And then I get something back, which is readable. So now I have like 60 base pairs per line And there should be header lines in there somewhere But the virus is so long that I can't even scroll up to see what's happening So I'm just going to take this little piece of code and then we go back to notepad And we're just going to add that, right? So I'm going to call this return sequence and I'm just going to query one No, let's query two, right? So then we have something to work with And then we can do some data analysis on that So of course because I did a string split string split returns a list So I'm just going to unlist it because I don't want to deal with the fact that it might have multiple Multiple elements in there. So just unlist it then everything's on a single line So if we now would look at the first 10 lines of this, um, then it looks somewhere like this. Oh, um, notepad Let me show you the R GUI again So let's look at the first five lines, right? So we see the first header line and then we see the code that belongs to it and somewhere in the middle There will be a secondary header line because we asked for two So let's figure out where those header lines are, right? Because in the end I want to have a I want to have all of the sequences in a vector and the names of the sequences are going to be the Names of the vector, right? So I'm just going to kind of combine the two data sources because now it's in one vector Sometimes a line means it's a header. Sometimes a line means it's code So how are we going to do that? So back to notepad plus plus So we do an unlist on this and then I'm just going to override my rsec Because that's how I roll because I don't need it, right? So first I want to figure out where are the header lines So I'm just going to say And what am I going to grab? I'm going to grab the larger dense symbol and I'm going to do that in the return sequences Right and this should give me back Which lines contain the header? So let's go to r and now when we look at header dot lines We see that we have one header line, which is oh, I didn't override it So I I looked at it, but I didn't put it in rsec So let me make sure that everything's okay. So I'm going to do the query again And we have our header lines and now we should have two positions, right? So the first line is the header line and then the next header is at line 2820 Good. So now I can figure out where my sequences are right because my sequences go from line number two to line 2819 so The last one is going to be from 2821 until the number of lines that I have So how do I build this up? Well, I know that my start Is going to be So let's go back here. So my start position is going to be the header dot lines Plus one right because it starts at the line after the header And then my stop is going to be and here I need to kind of think a little bit, right? So I need to throw away the first header line right because the first header line was one So and If I jump would just say header lines because normally it's it stops at header lines minus one But now it would get zero as the first entry which I don't want So I'm just going to say from the header throw away the first header line because I don't need that one to figure out to stop So what do I need to do more? Well, I also need to add the total number of lines that I have at the end, right? So I'm going to say length rsec And now if we would go to r Then r would be able to tell us that the start of the header is going to be at header lines plus one So that's a two and a 200 2821 And the stop is going to be At 2819 so the first sequence goes from here to here and the second sequence goes until 256 38 Good So let's see bind these two things together so to get a little bit of a table So I'm just going to say instead of calling start and stop. I'm just going to say C bind start is This comma stop is and then I'm just going to take the stop positions And I am going to call this sec dot pos for sequence positions, right? So I do a grep to figure out where they are so Let me add some comment faster header and the Lines at which the sequence starts and stops All right, so now I have my sequence positions. So now I only have to do one thing, right? I just go through this little table that I made we now have two, but if we would download three sequences We would of course get three positions So let me show you guys in r how that works so We download three sequences now. I just said one two three And then here we have our header line and now our sector pos is a little table Which says that the first sequence starts at two until here To there and the third sequence starts from here to there Good. So how can we now use this, right? So we can just say l apply to the sector pos to the first To the to the rows. I want to execute a function and function Of x right and just so what do I now want to do? Well, I want to get the sequences, right? So I just want to say well, um From my return sequences Right, what do I want to get? Well, I want to get a sequence from the start which is x start and I want to have this sequence run until the x end So and this will create a vector containing two three four five six seven and I need to make sure that they are actually numeric values because otherwise r will just randomly Take the character values from it, but I'm just going to say as numeric just to be sure Right, and now I have this little sequence Right, and this is just from the beginning until the end I just name all of the rows that I want to get. Um, so what I'm going to do I'm going to from rsec select these rows or select these Entries and then I'm going to paste them all together Um, and I'm going to say separator is nothing And to make sure I'm going to say collapse is also nothing And these are going to be my fusta sequences All right, let's go to r see if everything works if I typed it in correctly Um, and I did not so it says unexpected closing So I'm going to go to notepad and I'm going to use code highlighting So here I have a closing which I don't want because I want a sequence from the start until the end And then I want to use this as the selection to my return sequences And that should all be fine. All right, so let's go back to r And we should have fixed the All right match dot fun all is not a One is not a function. Ah, right. Yeah, because I need to not use l apply I need to use apply of course because I'm dealing with a Matrix because sec dot boss is a matrix Good. So a little bit of debugging Um, two must be a finite number. Why is two not a finite number? Should be All right Sec boss or finite Sec dot default. All right, so let's build in a little bit of debug and then I'm just going to say cut x start comma space comma x end And put a new line behind it All right, let's go to r run the code and It goes from two to na and that is because it's not called end, but it's called stop little little errors We'll encounter more, right? So we go here and we call it stop And we call it stop here and then we delete our little line of debug And then we're going to go to r and make sure that it works, right? So now when we look at our sec dot fusta, we should have two fusta sequence sequences So I'm just going to l apply to the fusta sequence and char All right, because just to know how many characters there are So you can see that the first american build from the monkey pox genome had 197 128 The new one has four base pairs less and then this is another monkey pox virus, which is the one from The uk so that's the great britain one, which is a little bit different in length again But at least we can have we learned that the monkey pox is a very very big virus And this very big virus is having a lot of sequencing data All right, so back to notepad plus plus so we still need to do something, right? Because we still need to put the names on the fusta sequences So I'm just going to say that the names of sec dot fusta is going to be The rsec and then I'm going to take the header lines And those are going to be my names. So if I go to our GUI just to check So now if I look at sec dot fusta and I look at the names of them I should get three names and indeed I get one two and three names Perfect All right, so that is our first challenge done So to do one is load genome sequences from ncbi and fusta We can do that using this, but I'm just going to make a little function So I'm going to say load genomes is a function. It takes ids, right? So Ids And I'm going to return Return our sec dot fusta Formatted at the end from this function and instead of using row name sequence ids one two three I'm just going to swap this out and say ids Perfect. All right, so we get a new file and then we put our function in And we're just going to say Save this and we are going to put this in r and this is going to be load genomes Let me call it download data Dot r right just a little bit of a different function name and I'm going to here also call this download data dot Well not dot r. All right, so we have our first function So the first function is just going to download data and I'm just going to Do the first three just to make sure that I have it That that I tested right because if I would download all like 75 sequences Then I would Stress the server a little bit and if it would go wrong, I need to do it like two or three times And I don't want to do that So I'm just going to call this ncbi sec, right? So and I need to also do a source because I need to load in my function So I'm going to say r slash download data dot r I'm going to source this file It will contain my function and I'm going to query the first three sequences All right, so let's go to r make sure that everything works and that we get our three sequences So now when I look at ncbi dot sec and I ask for the length I expect it to be a length of three because I asked for three sequences and the names should not have changed Good, so this seems to work So I'm now going to remove my artificial limitation and just download all 75 genomes in one go So I'm just going to press enter. It will take a little bit of time But that should download all of them and then we can check if everything went well All right, so let's not wait for r. Let's just go here, right? So I'm just going to read remove it so Very clear code like we have a single file that we source the code is there and then we have the function call All right, perfect. So r actually finished. So if I ask for the length of ncbi dot sec I have 75 sequences and I could ask for the names as well And then we would see that there are names So we we get a little bit of information right where they're from and but they're all coded That's why we make the excel file. The excel file is our annotation which we're going to use to Make three make trees or do other things Good, so the first thing that we need to do is make a blast a bay Right because we now have all of these sequences, but we need to do something with these sequences and we need to Annotate them. So annotating these sequences means taking a protein sequence and then blasting it to our blast database So let's make it a blast database. So fortunately a blast database is very easy because a blast data Database actually takes a fusta file as input, right? So we can just write out our fusta file So we can just say for header in names Of our ncbi dot sec. So our ncbi sequences that we just downloaded. What am I going to do? Well, I'm going to cut And I need to figure out where I want to store my database. So I'm going to say that database is located in data slash dna FSA, right? I'm just going to give it any extension. It doesn't matter where you save it So we're going to cut and we're going to cut to file is our database And what do we want to cut? Well, we want to cut our header and then we want to add a new line to the end and then we are going to Cut our sequence That we got so we're just going to reconstruct the stuff that we got from ncbi And I am going to say append is true So that actually we don't overwrite the file every time and that means that I also have to make the file So I'm going to say cut nothing Empty string into our database. So file is db All right So I'm just rewriting out the data that we got from ncbi to a file on my hard drive This also means that I don't have to continuously reload The the files from ncbi because once I have the database then I can just use my database Good. So run the code runs really really fast Let me open up the database for you guys so that you can see it. So let's go back to notepad We go to see hit hub monkey 22 data and then we do our dna database dot fsa, right? So here we just see the database that we just created. So again the Names with the sequences and of course the sequences are on one line Fortunately blast doesn't care about that the fusta convention is to have 60 characters on a line, but we don't have to All right, so again creating the blast database. Let's make a little function out of it as well, right? So to keep in the structure. So I'm going to say function So make last day bay is a function which takes as an input sequences and Going to add this and in the end I want to return the database location Where I stored it and I want to also make this Variable because we might want to make a database with Protein sequences as well. So I'm just going to put the database as an option Give it a default value and now I have to make sure that I replace all of the NCBI sex Um, yes, so the NCBI sequences with the sequences that have been put in Just in case we might want to download two or three different NCBI sequence databases Good. So now we have our make those last database. So I'm going to of course call this with our NCBI sec and I'm going to Store this as database so that I know where the database was created Good and I'm going to make a new file put the data in I'm going to close the monkey annotation And I'm going to save this in r and I'm going to call this blast day bay dot r All right, so now here we have to source it And we are going to source blast day bay dot r and then we have the function available and we can do it Good Any questions so far we downloaded sequence data from 75 different genomes We made our blast database so that we can blast and we now go and do the next step And the next step is going to be load our associated proteins from NCBI, right because We want to get protein models and we need to get those that belong to the virus So there's two ways of doing this. We could just go to NCBI and start doing the same thing as what we did Had just make a new excel file put all of the different proteins in there But we're not going to do that because we're going to use this r interest function again, right? So this r interest function is a very very useful function Or very useful package Because it allows us to do cross database queries as well, right in the download data. We just do an interest fetch But interest also has an interest link function So the interest link function looks a little bit like this So we can do interest link and then we can say db from Is so we want to search from the new core database because we had the the nucleotide database I want to say that the id that i'm going to search from is the one that we found right because when we looked into the Firefox window, I think that that's this one Firefox where are you we found that this one here Right has proteins annotated. It doesn't just have the dna sequence It does that's all the way at the bottom, but it also has proteins associated with it. So just let this zaire 69 96 so it's it was found in zaire in 1996 and i'm just going to take this This number right from from here and i'm going to go to notepad and i'm just going to say well I want to get everything Which is in the nucleotide database attached to this nucleotide database id And i'm going to say i want to get everything which you have in your protein database Right, so i'm just querying from the nucleotide database asking which Proteins in the protein database are annotated to this sequence And i'm going to call this links Right because it's called interest link All right, so how does this look in r? So if we go to r if we do an interest link Then now we get an object And this object is pretty complex, right? It's an e link object with contents It has links or the ids for the linked records from ncbi So How do we get now the protein sequence as well? We can ask the str to look at the structure of this object and then we see that hey interesting It's a it has something called dollar links in that's a list of two And there are the new core proteins in there and we also have the coding sequences We don't really care which one of the two we take So let me see how that looks so we say links dollar links Then we get this and then the links that i'm interested in are the new core proteins And these are the ids for the proteins which are associated with this Monkey box virus which was detected in zaira So and let's just copy this and we go back to notepad So in notepad we're going to say that these are the protein dot ids And now of course we need to make sure that That there are really proteins right because if the length of this Is larger than zero then of course we have proteins Because and let me let me make sure that that is correct Let me open up this monkey box annotation because when we when we looked at the original data from the american group We saw that there were no links no protein links Right, so if we would just query this american sequence by just updating the id Then we would not get back anything right because now if we look at our link subject It will still say that it's an elink object But when we look at the links It's no there are no protein links which are associated to this virus So in this case we can't use the american sequence because they have not annotated their proteins yet But we can use the zaira virus to get all of the proteins in theory, of course We could go through all of the 75 entries that we have See if there's a protein associated with the virus and then Remember that but that's going to be 75 queries to the database and we don't want to stress their database So i'm just going to say well, let's just do one first right now Let's let's not do 75 queries and download hundreds and hundreds of Of proteins right because you can see that from with one genome sequence right so from this one genome sequence There are 191 proteins annotated on the genome Display is not clear. Oh is that so? Does it need to be bigger? Let me do an edit GUI I can make the GUI a little bit bigger. I hope that it that's more clear than Is this more readable? just Hoping that it's good. I don't see any issues. It might be my streaming quality I'm streaming from a laptop since I'm at home. So I'm not at work. So I don't have the Like fantastic internet that the university normally gives me but um We'll just have to do it and of course you can get the code from github So I'm just going to write down this function and then we're going to commit the functions that we had so far So you can follow along and see it high definition there But have what we see here is that when we when we query a nucleotide sequence, which does not have proteins associated with it We just get a null back and that's not what we want because we want to get the protein sequences but Like this should be okay. And I need to see my streaming All right So if the length of the protein IDs that we get is larger than zero then we get proteins and now we want to download the sequences Fortunately, we already wrote a function for that, right? We wrote a function called download data But the function download data if we look at it It actually has the database that we want to query hard coded. So I'm just going to say here Dabay is And standard I want to query nucleotides Nucleotide right and here I'm just going to change the word db or the nucleotide by the db that we specify So now here in my analysis, so if the length of the proteins is larger than zero I can now just use my download data. I can give it the protein IDs and I can of course now query the protein database So I'm just going to update that right? So let me make sure that I actually Reload my function. So let's go to r reload the function. So now when I look at download data The function actually now has two parameters So the IDs that I want to query and then the database I can specify by the second parameter so that should be okay and Of course, we get our faster data back. So we can we also want to have the proteins in in faster So when we go to notepad plus plus we can just say well, these are our protein dot sequences And we download them and then we have them And then we want to return them because I want to make a function again, right? So I'm just going to say protein dot sec return it And if if we didn't find anything I just want to return a0 Return A0 Right and I'm going to make this a function. So this is going to be query proteins And that has uh, that's a function and it has an id Right the id that I want to query standard I want to not query the american sequence, but I want to query the other one. So let me get the The one that we were looking at which was on firefox So let me go to firefox So the zaire version of the protein, which is called this one. I'm going to change here the id And then that's it. All right So now if I would do query proteins query, oh query proteins proteins And I would just call the function then I would have m dot prot My proteins or something like that and again, this one goes into a new file So I'm just going to create a new file copy the function in and I'm going to call this query proteins r And I need to make sure that in my analysis. I now source slash query proteins Dot r right. So this makes the function available and then I can call the function So let's try that, right? So let's make sure that we Run the whole code from top to bottom, right? So just to make sure that we didn't do any any mistakes or that the code doesn't depend So I'm going to go to r and in r. I'm going to throw away all of my objects, right? So if I remove my objects, I can just do this call and now my r is completely empty. Nothing is loaded So I'm just going to copy paste in the code, right? So we're going to download the data from all of the sequences. So all of the dna sequences We're going to create our blast database and now we have m dot prot Which should contain all of the protein sequences. Of course, um, let me just look at the first one And so here we see the first protein, which is called chemokine binding protein from the monkey fox Zaire virus Of course, it's very mean to now just use this protein and query it, right? But we can if we want we can just query this to the genome database and see how many hits we get um, but of course This would perfectly fit to the zaire genome, but it might not fit to the current version of the monkey box Genome, but that's why we want to build In the next step a protein model Right, so I am going to query a couple of these, right? So I'm just going to say While m dot proteins, right? Because we know that from from r, right? We know that if we ask for the length of this That we have 191 proteins on the genome So let's just download like 20 ish annotated genomes so get the the proteins from like Well, why not do all 75? We're just going to do all 75 So I'm just going to say so for ID in The names of the ncbi sequence, right? I am just going to query the proteins of this ID And then I'm going to remember them So I'm going to say proteins Oh, I'm going to say proteins is initially empty Then we're going to query the protein database and we're just going to say proteins I don't know why that doesn't work. It gives me the suggestion to But I'm just going to say proteins So I'm just going to go through every genome that I have loaded query the databases and then um, it's um Full screen is okay. It's not your internet. It's a bit blurry when it's small. Yeah. Yeah, let me let me make r a little bit bigger Just to make sure, right? So say when it's good, um I'm gonna go to edit GUI preferences And then I'm going to do size 12 Oh, then it becomes really big Well, in the end, we just want to see like fancy fancy. Let me just say names of m protein Right. So these are all of them. So this This should not be blurry Now because it should be like big enough to view All right So the thing that I'm going to do is I wrote my little query proteins function I'm going to say I have a list of proteins initially. It's empty I'm going to go through all 75 genomes and then I'm going to query the proteins associated with That genome and I'm just going to make a big list, right? So in the end, if every genome would be annotated, I would expect 75 times around 190 Protein so that's like 7,000 ish proteins But not all of them are annotated fortunately But let's just start this in r, right? Because um I am going to do one more thing because I'm going to say sys.sleep And make sure that you sleep like half a second after querying the proteins Just do not stress the server too much, right? Be a friendly neighbor and don't Stress the ensemble or NCBI server Good so and this will run for some time and it will download all of the proteins and we could quit it halfway through Because we don't need all of them, but in this case we could get an idea All right, so this is just a little bit of waiting. So I'm going to do a Quick quick cigarette break and I will be back in a couple of minutes And then we can more or less continue Um, so let me Mute my microphone. I didn't prepare any animated gifs. I'm very very sorry. Normally when I do lectures I create animated gifs and I have this like waiting screen Um, but unfortunately I forgot to do that So you guys just have to look at the background screen and um I will be back in like five to ten minutes And we have to wait anyway, right because it's a large amount of data And we're only going to do this once and I'm going to save it to a file to make sure that I never have to download it again Yeah, just to be like friendly to the NCBI server Good, but we're making good progress like if we look at the code so far We did to do zero one two and three So we're kind of halfway through what I wanted to show you guys to do a real kind of analysis And of course, all of these things Are in the r-course all of these like tips and tricks like using rep or using the l apply or using apply Um, you you learn those when you when you do the r-course and I think after like lecture number five. All right Very good. Very good. It actually finished. So I'm going to forgo my cigarette Um, so let me ask the length of proteins now first look look at the warnings, right? So the warnings are a lot of them Right because it it gives me all kinds of xml errors And that is because a lot of these don't have anything annotated Right. So if there's no annotation in the protein database, it just gives me an error when I try to query non-existent sequences But now when I look at proteins, I should have a big list So what is the length of it? So we downloaded in total 12 proteins. That's not a lot. That seems too few That seems way way way too few renamed That is a very interesting That is very interesting Because I I expected it to be much much much longer But that's that's interesting why it doesn't do that Interesting. All right, so let me see names of NCBI sec. Those are these I already understand. I already understand We we use the names of the sequences, which is the fusta header But the fusta header is not what we should use the the short name So we should not have used the names of NCBI sec We should have used the row names of our little table, which is called source IDs So let me fix that and then Row names Off This one Source IDs. All right, very good. So let me check Right, so let's go to r make sure that we check that So, yeah, we only want to use this part of the name So I'm actually surprised that it found proteins while giving it the full fusta header In some reason it figured it out But um That means that I can still have my cigarette because we need to download our proteins again So I'm just going to copy this go to r and then paste it and be back in around Well took around Four or five minutes, right? So I'm just going to do a very short break and then we can work with the protein sequences and start doing some protein models Good, so I will be right back All right, so I'm back. Let me guys show you what r came up with so r came up with an error and that Probably means that I was querying too fast but Let's see right length proteins. How many did we get before we got cut off? So we got five thousand three hundred and ten proteins, which I'm perfectly fine with so this is more than enough so We now have all of the protein sequences We have well, not all of them, but enough of them and we have all of the 75 Genomes that we are interested in so we can now continue so Very good. So next step Go back to notepad Create a protein model, right? So we need to now find one protein that we want to focus on so this there's two ways of doing this We could create a protein database take a protein and blast it We could also just look at the names, right and see if there's any names that strike out or are interesting to us So let's go to r and just look at the names of the proteins, right? So this this is a long list and I'm just going to table them Because the same protein should be in there multiple times, of course And then when we go all the way back then we see here. So there's a bunch of these GP 100 and threes that only occur once And we also don't really see anything interesting here. So I'm just going to look at the top 25 Names after tabling them. So for some reason all of the All of them are only in there once which is a little bit annoying Let's ask for the maximum value to see if anything occurs twice No, which is logical that they don't occur multiple times And that is because they all have unique protein names So that's something that we have to remember But I want to I want to see a little bit more want to see if there's an interesting Interesting protein name that we can use And otherwise I'm just going to take one So there's this thing called the viral kind protein So let's see if that is actually downloaded So I'm just going to use grep. So I'm going to say grep through the names of the proteins And what do I want to grab? Well, I want to grab for viral kind Why not? Right? It's just the name of a protein. So let's go to r And see if it's there. So yes, we have Seven versions of the viral kind protein. So if we look at the proteins, then I think I selected this because it's not too big of a protein So you can see that it's a relatively short protein, right? And you can see that it's in several monkey box viruses that they they found it in And we can see here that it's not not a crazy long protein Right? And when we look at it, we kind of see that it almost seems like there's no variation in them So that's a little bit of a shame because we would want to have a virus or a protein which has a little bit of a variation, right? If there's no variation Then We can't build a consensus model. We just know that the protein is the same But in theory, we could just take this protein Blast it to the genomes and see where it is located in each of these genomes And if it's located more than once in some of the genomes But there doesn't seem to be any variation in this case So it's not too interesting to build a model of So let me actually do something else Right because I want to know Which names occur multiple times? And I've tried to use the table, right? Which didn't work because of the initial name, but you can see that if I would just take the names of the proteins And then just do a string split on these names by space And then I would kind of Dump the first one or or take this and so now we already start seeing that there are many more Interesting protein names, right? So for example the monoglyceride libausa, perhaps that's a little bit bigger So i'm just going to grab for that And look if there's any variation. So yeah, here we see some variation, right? So it's mst in one version msa msa in another version So we see that there's a little bit of variance in this protein not a lot So but there is a known more or less polymorphism When we downloaded it Good. So now we have our proteins, right? So we can do two things so we can We can just blast these proteins or we can do a multiple sequence alignment But the first thing that I like to do is make sure that we use Something which is called leavenstein distance to make sure that we know that our proteins are relatively closely related So we can do this by using the distance function Which is from the string disk package. So when we go here What we can do is we can just say give me the string distance of all of the proteins versus all of the proteins and First the string distance of the names of the proteins No, not of the names of the proteins versus all of the other proteins and I want to use method is LV Let me just do it for the first protein against all of the other ones so that you guys can see how it looks like So when I would do this Then it will tell me the distance of this protein to all of the other ones, right? So we see for example that protein number one has a relatively or has no variation to this one But it has only six distance to this one So this this leavenstein distance gives us a number and this number the higher the number the more divergent these are Right, so if I would ask anything under 25, right? Then I could ask which ones are lower than 25 in distance Then I see that this protein actually has a lot of other proteins which are very very similar to it So if I would just look at proteins And I would just look at these entries, right then you can see that yes These proteins all look very similar Sometimes it's a little bit longer. There's more variance in there. So let's just use this one, right? Why not? So I'm just going to say that my selected protein is the first one from the list We can look at the names of them as well So we are looking at the chemokine binding protein and this chemokine binding protein is sometimes called slightly different But using leavenstein distance, we can say well 25 we could be a little bit more stringent And see like what will happen if we have a maximum distance of 15 Then we see that we still get like 55 proteins in our 5000 protein database that we just built which are very similar Good, so let's make a consensus sequence out of this one. So we have 55 entries So I'm just going to take this whole code right And I'm going to go to r And in r. I'm going to say well Do this And then say this these are similar Right to or similar to the first one, right? Of course, we could do this for all of the proteins But then we end up with like 180 Something unique proteins But now we have one protein which is found in a lot of these different strains And it is very similar in all of these strains. Sometimes it's a little bit longer. Sometimes it's a little bit smaller So we need to do multiple sequence alignment and fortunately we have the msa library already loaded because that's here So we could just do the multiple sequence alignment of these proteins and be done with it So I created I think a little bit of code because we need to Make an amino acid string set. That's what it's called So we can say so that's similar. So when we say proteins Similar we get all of the ones which are similar to the first one So similar to this kind of chemokine binding protein, right? And then we just say a a string set And then we create an amino acid string set out of it and then we just do multiple sequence alignment on it All right, and then we call this My msa Which comes out All right, so let's go to r. Let's run the multiple sequence alignment on these proteins Uh, amino acid string set the s is also capitalized Uh, selecting a method objects similar Not found. Oh, right because I didn't save it. So let me go back to notepad make sure that I So let me delete this one so let's Let's first find all of the similar ones and then we use those to build a multiple sequence alignment tree So we go to our GUI And I forgot to fix the s as well. So let me just copy paste it and All right, so now I did a multiple sequence alignment so we can just plot this if we want it Right and then oh we can't plot it because it We need to first convert it to a second air alignment. Um, but We can get the consensus sequence. I think So we can get the consensus sequence by using the msa consensus sequence function on the multiple sequence alignment that we did And then it will tell us that this is the consensus sequence All right, so we can see that every protein starts with more or less the same Then there's a part which is variable and then we have the end of the protein which again is very similar So it's just the msa If we would look at my sequence alignment, it would look like this So you see that there are some which have like an insertion and others which don't So we build a consensus model and now we can use this consensus model to to scan right? So we can use this consensus model to To find where this thing is located in the monkey box genome So let me get this code. So my msa We now have our consensus sequence and then I want to tblast and the consensus sequence to the viral genome database So fortunately, we already built this database. So we don't have to do anything like that So we can just use blast. So how do we use blast? Well, again, the input to blast is going to be a fusta file So we're going to do the same thing as that we did before but this time because I'm doing blast I'm going to use a temporary file. So I'm going to say temp file Is or I'm going to call it tmp file is the function temp file and this will just give me a temporary file I'm going to empty out the temporary file tmp file And then I'm going to say for header in No, I don't have to do that at all because I have a consensus sequence. So I only have to blast one So I don't have to do a for loop. I can just say cut This is going to be For what was it called again chemokine binding protein And we're going to use the consensus for the chemokine binding protein We're going to do a new line a new line And then we are just going to cut The consensus sequence Consensus sequence and we are going to add a new line. So this is going to be my query file So for blast I need to specify what I want to get back And I prepared this beforehand because it's just a lot of explaining to do But when you go to blast and you go to the blast manual then in the blast manual You can say that you can specify your own output format So how do I want the results from last back? So I'm going to say I want to have it six, which means I want to have it back in a table I want to have the query sequence id the sequence which we find to start and the length and these kinds of things And then I'm going just going to call blast which I installed on my computer. So I'm just going to say Wait, we forgot one thing in the make blast a bay Because when we make the blast a bay we only output the faster sequences I forgot that I also have to do an indexing step here So the indexing step is relatively easy I'm just going to say This so I'm going to make a system call where I say make blast a bay So this is just going to be a command line call. So blast needs to be installed on the command line So it is Making the call saying that the in is my database, right? So the file that I have And then I need to give it a name So I'm just going to say this is viral genomes or something like that and my database type is going to be nucleotide nucleotide Ah, let's keep it the way that I had it, right? So here we have a database Then we have to specify a name. So name is monkey debay or monkey dna And we are going to say db type is Nukle Right. So now I'm just going to use these variables. So let me actually rerun the making of the blast a bay In r so that it does the call to the command line as well And now when I would have db No, I first need to Source the file to update the function And now when I make my blast a bay it will actually tell me that it is building a new database It took a couple of seconds or less than one second and it added 75 genome sequences to there Good. So now we have our database All right, so we have our database so we can now start querying it so we can say do a system call The tool that I want to use in this case is called t blast n, right? Because I want to blast nucleotides or I want to blast protein sequences to a nucleotide database The database that I'm going to use is just the database that I got back from the blast call The out format is the format that I want to query So I'm just going to give this format and then I'm going to say temp file is The temp file. So this is where the query object is All right, so let's go to r, right? So I'm just going to first Oh Here cut I want to do this to file is tmp file Append oh append is true All right, and I'm going to have to add the consensus sequence also to the Temp file and I also want to append this So let's just go to r Copy it in Consensus cannot be handled. Oh, I'm forgetting to copy code again. So My consensus And then I'm going to use here my consensus. All right, so let me do this similar the multiple sequence alignment Do the temp file And make it All right, let's go back to r All right. So now when we open up this temp file, right, which is called This so I can just go here and get the file so I can just say Probably here I can go there and then I can get this temporary file No, I can't because for some reason this file does not exist anymore Yeah, well, you just have to believe me. It's just a simple fusta file again Which we just had we do the same thing for the blast database as well But I can't show you the file But the file will have two lines in there The first line will be consensus chemo-crime binding protein and then it will have my consensus The string that we just made in there And now we can start and try the blast Because we now have the file So we just specify that we want to do blast with The database that we have we want to do tblast n and we want to specify our own output format So when we go to r and we do this call Then it should be okay. So now we get the rest back And here are our results from last so we just took our consensus sequence And then we blasted it to the 75 genomes in the database and with that we now get our results back It says that here see fusta reader hyphens are invalid and will be ignored around line 2 So why is it saying that? Well, if we look at my consensus sequence You see that there are hyphens in there and this is not allowed because in a consensus sequence the hyphens should be replaced by ends Because anything can be there or nothing right? So a hyphen means a gap But when blasting you can't have hyphens for gaps You have to have ends for any nucleotide So let's fix that right. So when we print my consensus, I'm just going to say g sub So g sub my hyphens by ends Right. So any amino acid or a gap is perfectly fine. So let's go back to r Let's copy paste the code again and see if we still get the error at the beginning So when we look at res now, then we go all the way to the top and now it has no problem with this Good, so you see that we just get a bunch of strings back This every string is one line of the blast output You see that the output is very good in some cases some cases It's not that good because the e value is in the last column like I asked But now we have to parse this right? So we have to parse this into A format that we can use right? So what are we going to do? Well, first things first We are going to remove our temp file Because we don't need it anymore. We already have res which contains our blast results So now what we want to do is we want to make this into a matrix And this is relatively easy But what I'm going to say is I have values right the values are initially na So that's every value in my database So what am I going to do? Well if the length of what I got back Let me actually move this So if the length of the results that I got back is more than zero, right? That means that blast found something now I can get the values that I have So I'm just going to say well take my results and string split It by the tab character right because in r we saw that we have this tab character separating the individual values So that's what we can do and then we are just going to We are going to l apply to res The string split function and we are going to split by tabs And this is going to be perfect because when we do this in r then we see we get something like this Right, so it says that the consensus was found in this ID from and then all of these things right, so I'm just going to l apply the string split and In the end I don't care about any of this. I'm just going to say unlist everything and those are my values Right, so everything will now be on a single line. So if I now would look In r and I will do this call and I would look at my values then the length of my values would be 2016 Why 2016 well I have an x number of rows every row contains an x number of values separated by tabs so I have in total queried four 1 2 3 4 5 6 7 8 9 10 11 12 13 14 So I asked for 14 things 14 14 14 Anyway, I asked for 14 things right so I have 14 columns And I know how many rows I have because my rows are the length of res So I can now just say well make a matrix Make a matrix of my values How many rows do I have well the rows are the length of res I have 14 columns and I can directly say by row is true right because the The values are row based because I did row one then I did row two then I did row three and I am going to call this mm for my matrix. All right, just simple try. Let's go to r An unexpected symbol in mm Right, I forgot a comma after by row, right? So now if I look at mm, then this is a matrix of the output that we just got back So we can check that everything's okay, but it's filled in a row wise manner I asked for 14 columns. It gives me 14 columns. So this should all be perfectly fine So let's just Call this res now. So override the results that we got And now we can actually say that well we can add the column names, right? So we can say that the column names of res are actually the names that we just queried for So those are these Right because it gives us in this Order and I need to make sure that I change this every time By this right because those are the names of the columns that we queried for So I'm just going to use the blast names on the matrix that I have And that should be perfectly fine. We have mismatch. All right It's bad because we want to do here. All right, lay out a little bit Come on Change it All right, so these are then the names of the columns that we have Right, just the same as that we had before but now we just put them on res All right, let's go here. Let's go to our gooey Right, that's true because actually there was a comma missing and I didn't fix that So Let's fix that and now when we look at the rest, let's look at the first Five rows then it looks like this right so we can see we queried the consensus to this sequence. It's It found from one to two Five two so it found the whole sequence But in the genome of the virus it starts at a hundred and ninety seven thousand and it ends at a hundred and ninety five thousand And we see that in the In the next genome, right? No in the same genome it finds the same sequence again But now at a different position So it finds it at seventeen hundred until one thousand So this gene is actually twice in the genome. So that's something that we learned We learned that every monkey pox virus comes with two of these chemokine binding proteins So We already learned something that's good. That's good. Okay, but Now we need to do something else still because have we need to make rest a data frame because when we look at it in r We see that everything is a character Which we don't want because some of these columns are numeric. So we're just going to say data dot frame Um res and these are my blast results. So I'm just going to call them res again good one of the things that we know from blast and Let's go back here is that every blast hit comes with an e value, right? And the e value tells you how good the hit was So we see here that some of these hits are very very certain like it's kind of a p value But then for blast, um, but we see that this one actually is not really significant So it seems to be in the genome, but there are five gaps and 30 base pairs in the gap The number of identity matches is only 40. So it's it's a it's 40 amino acids of the whole 142 No of the whole 252 proteins Is is in there, right? So it's only a very little piece which matches But the rest is all mismatches. So that's why it gets a very low e value score Furthermore, we see that something annoying has happened with the sequence IDs, right? Because our sequence IDs which are in the row names of How did I call that again? I called this source IDs, right? So because when we look them up Our names are like this But for some reason while blasting it put the Put something in front and in the back, right? So we have to fix the asset the source sequence ID column because it it it has It doesn't one-to-one match the IDs that we gave it and this is this is common in blast blast always does this It recognizes that this is a genbank ID And sometimes it recognizes IDs as being reference or reference sequence IDs or other types of IDs So it puts these things in front. Fortunately, we can easily parse this out because when we see two of these horizontal bars We know that our ID is going to be in the middle of them So we could use a very fancy regular expression. I generally don't really like regular expression. So I'm just going to use the l apply and then do a couple of reps to kind of get rid of that So let's go back here. So the first thing that I want to do is I want to say res comma e value Right and as numeric just to make sure that it's a numeric value This needs to be lower than 0.01 before I'm going to consider it Right and then I'm going to ask which ones have this low on e value and then I'm going to To take those so I'm going to subset my matrix with e values Which are good enough for blast so that I know that I'm only looking at hits Um, and then I have to fix this uh s sec id column. So I'm just going to say well res comma S sec id Right, so if we go back to r just so that you guys know how this looks. So here you see that sometimes it's gb Sometimes it's embl. Sometimes it's the reference sequence. So there's different different Different identifiers there. So we can't just Grap or or g sub or even build a regular expression for this. Well, a regular expression might still work Um, but I'm going to do it slightly different because I'm just going to say well What do I want to do with these things? Well, I want to For each of them, right? So for the first one, I want to grab If there is a horizontal line in there Right and if that is the case Um, then I want to know how many there are because it might be that the id that I use also uses this horizontal line So I'm going to string split My res and I'm going to split it by nothing so that it goes into Individual letters, right? So now you see individual letters and now if I would grab For the horizontal bar in the string split then it will tell me one Why will it tell me one again because string split gives you back a list? So I'm going to make sure that I unlist this. Um, and then it should now give me two positions of which there are the Horizontal bars you see that it doesn't give me two positions It gives me every position and this is because this this thing here is also a wild card So I have to add fix this true, right? Because I really want to search for this character I don't want to use it as a wild card. So the third is a horizontal bar and then the 14th is as well Good, so now I can build up an if statement, right? So if The length of this whole crappy thing is larger than Is is two then I know that I am dealing with one of these id's that was changed by a blast Of course, I don't want to do the if statement here, but I'm just going to copy paste this into notepad So that we work there, right? So we're just trying to fix this s s s sec id problem where it puts them in So I'm going to do this for every sequence id. So I'm just going to l apply to res comma s sec id What am I going to do? I'm going to apply a function which works by x And now I can change this whole thing here this res one sec id I can just say x right because I look at every entry individually So if this is the case, so if it is two then I want to get rid of those, right? So how do I get rid of those? Well, I can again use string split. So I can just say string split x Right, so string split the current protein identifier that we're looking at string split it by this horizontal line Fixed is true Right, and then I want to l apply to everything what we get back I want to select and I want to select the second element Right, because if we go here, let me show you guys that in r, right? So if we do a string split of x and x is called res one comma s s e q id We call this x, right? So if I string split x By this horizontal line, then it says that the first element is called gb Which I don't want and I want this name back because that's the name that I'm using for all of my sequences is the the standard more or less Good, so this seems to work, right? So I'm going to l apply I'm going to select the second one and then I'm going to unlist it because l apply and these kinds of things Always need to be unlisted and I'm just going to return this if I want to Right, so I'm just going to return this If there are not two of these horizontal bars, we might have a different type of name So then I'm just going to not change the name. So I'm just going to return x itself good So Will this work? Well, let's see if it will work. So we go back to r and now we see That indeed it ate or hey, it selected the based on the two horizontal lines If it finds two horizontal lines, it will take the middle element And when there is none Then it doesn't do it and this of course becomes More pressing when we have sequences, which we did not get from ensemble Or ncbi, but when we get sequences, which we download from virology.org, right? Because if we download our own fuster files Then of course we We might not have an official id So because of that we need to fix that All right, so I'm going to enlist everything and then these are going to be my id So I'm just going to put this back into the column where I took them from So I'm just going to oh you guys can see that right. So let's go back back back So have we have the l apply which works? So I am going to then say unlist all of this And then I'm going to put this back into the column where I took them from So in this case that's called res s sec id Good. So now if we would run all of the code, right? So we would query Oh, I already so let's make the query again. So let's do the whole Section five. So we we make a temp file. We put our sequence in there We specify the out format we do blast We load in all of the values we split them using tabs We put them into the matrix and then we make sure that We only take things which have a low e value from our matrix and then we just fix the ss sec id names All right So do the blast query and now when we look at the rest then now our rest table should have very significant entries and We now see that indeed we have Different. Well, not all of them are perfect. Like this is only part of the whole thing But that's good So and our sequence id seem to match as well Perfect. So now we have Done part number five. So of course part number five We are going to do the same thing as what we did before. So we're going to just cut the code And we're going to make a new file. We're going to put it in We are going to put this in r and we're going to do blast dot r All right. So now when we are in our analysis, we're going to source um r slash blast dot r and then I am going to make a function out of this Right, because this is going to be a function. So I'm going to call it blast Is a function And we're going to do this and in the end we want to return our rest So that is good. So do we want to actually Yes, because we want to give it sequences, right Or consensus, but in this case we are We're we're dealing with With the consensus sequence, right and here I actually gave it a name, but that's that's a little bit wrong But this is okay. So I'm just going to call sequence single sequence single sequence not sequences Do we need anything else that we want to specify? Well, we want to might want to specify the blast tool So this is the tool And we are going to say tool is T blast n because we might want to reuse this Function when we are going to do standard blasts or nucleotide nucleotide or proteins versus proteins So when we now have our blast function, which we can use On a single sequence to blast it towards all of the 75 reference genomes and then we can do something So just blast my consensus and then call this blast dot res Good So now we've did more or less five things on our list. So we downloaded our sequences We made a blast database. We queried the proteins From ncbi, which we gave us 5000 proteins. We look through all of the proteins So we take the first protein in the list and we look to see which ones of them are similar And when they are similar, we're going to do a multiple sequence alignment And then based on the multiple sequence alignment, we created consensus and then we blast this consensus to the reference genome Good. So now we are at a point where we can finally finally finally start making graphics Right, so I'm going to start making graphics. So I'm going to say plot. All right So what do I want to plot? So I first need to figure out what my How many elements I have on the x-axis, right? So on the x-axis I'm going to put all of the different viral genomes, right? So I'm just going to say plot c 0 comma n row of My source IDs, right because I have that many genomes I need to figure out what the y-range is, right? So the y-range I could just say well the virus is like 200 Thousand base pair. So I'm just going to say from zero to 200,000 But I don't know if that's exactly true. Fortunately, we have our blast results, right? So if I go to r Then I can show you the blast results Have to source of course the blast function, right? So now when I look at blast address it Let's look at the first five rows. It has the s s start and the s end So what I'm just going to do is I'm just going to take from blast Oh, something goes wrong. I know that when those were so I'm going to take from blast address I'm going to take the start as numeric And I'm going to take and I'm going to combine this with the sequence end and Then we're going to do it like this and now I'm going to ask so these are all the positions And I'm going to say min positions min positions comma max positions And this should give me the lowest value and the highest value So just to see if it works. Let's go to r So in this case the minimum Location of the first or of this this consensus sequence is a 247 and the other one is a 227 Thousand actually so much larger than I would have expected But we can go here and we can now say that our why Is running from the minimum of positions to the maximum of positions? And I'm just using a comma I'm going to say type is non x-act Is non. I don't want an x-axis and my x label is going to be empty All right, so when we go to r how does this look so it looks like this So very basic empty plot. We have here the positions, which is fine We have here all of the different reference genomes or at least space for all of the different reference genomes And we're now going to add the names to it, right? So we're going to just say well add these names to the axis. So axis One, which is the x-axis at is one to the number of rows Of the source ideas And what do I want to put there? Well, I want to put the names there But the names themselves don't really mean anything to me right because I'd rather have something like USA 2020 Or USA 2020 and then have my subtype on there, right? Because you can see that some of them are called Akmata some of them are cowpox So I kind of want to know and see this in of in the in the in the in the plot, right? So I'm just going to say well, I'm going to make something called display names And the display names are going to be A paste zero Of the source IDs And source IDs and I'm going to take the country And then I'm going to take source And I'm going to take the year And I'm going to take from source ideas. I'm going to take the subtype Um, I think that's written with a capital Yeah subtype so sub and type everything's written with a capital so sub and type All right, so If I now look at my display dot names display names So I just do the call in r and then when I look at the display names I see that oh that looks bad, right? I want to have spaces in there But the first one is USA 2020 monkey 22. Um, so let's put some spaces in there as well So I want to have a space between here and I want to have a space here for the subtype as well All right, let's go to r. Let's Paste in the display names. Look at the display names And this these are names which are suitable to put on our plot on the x-axis. So how are we going to do that? Well, we're just going to say at these positions Show the display names. Don't show the original names, right? And of course I need to make my plot a little bit bigger So I'm going to say op par Mar is c Give me like 10 at the bottom at the left side. I only want two At the top I don't care about and the other axis I also don't care about So I'm just setting some margins so that when I make my plot The names will kind of Be able to be shown All right, so here we see our plot then Um, and I have to make sure that they are Rotated so I'm going to say l es equals 2 cx.axis is 0.7, right because they are too big to show otherwise So just a little bit of layouting and now we can see that indeed we start seeing More or less all of the 75 genomes that we have so we see the different monkeypox then we see the nigerian outbreak from earlier, so 2017 here we see for example the different Akmeta viruses and here we see for example the vaccine we here have the variolas and the Yamada strain which are all not monkeypox but smallpox So now we can actually start plotting the data from last on this little consensus sequence across the genome So let's do that So let's let's do that because in the end we want to kind of visualize the genome and see if the layout of the monkeypox genome The current monkeypox genome is different from for example smallpox, which would be good because we don't want it to be smallpox But here we have our display names. We have our axis. We have our plot So the only thing that we now have to do is make sure that we plot all of the different All of the different values, right? So what what are we going to do? Well We have a single protein that we're looking at so I can just say four in Not genomes I'm just I'm just going to say four name in the row names of source IDs Right, so now I need to find if this Name is actually in the blast result. So I'm going to say blast results at this At this column that we just fixed because had the column was wrong. So let me see plus plus plus the pay plus themselves Which was called? Uh s sec id so source sec id Is his name? Right, which ones are those and I'm going to call this ii so for i in ii Right because it might be that the same gene is multiple times in the genome So I have to make sure that I have a for loop for all of the genomes But I also need to have a for loop for all of the times that The the gene might be in the genome And then I'm just going to say well, okay, so now I know on my blast address Right, and I know that I can take i And I know that I need to have the s start And so this is ys The end of y is of course the same so s end And now I can do a plot And our points right so I can say points Which gene am I currently at or which which row of the genome am I currently at? Well, I just have the name so I don't know that but I can figure that out because I can ask Which row name of source ID? So my x position is going to be Which row name of source ID? This is the name that I'm currently looking at All right, so now I can say draw a point at x plus comma x plus And then we can say the y position is going to be y start comma y end And we are going to say type is Um line Something like this. All right. Let's see what we have so far, right? So we're just going to very basically iterate over the genomes for each genome We're going to try and figure out if the gene that we are looking at which we made a consensus sequence for us in there And then we see that it looks like this, right? So it's very very tiny. I hope you guys can see it. Actually, let me Do this a little bit different. I'm going to say b for both Um pch is 19 and color is orange All right, so let's add some orange colors, right? So good So now we start seeing some interesting things, right? So what we can actually see is that this gene Is normally two times in the genome of monkeypox, right in the new monkeypox It's in at the beginning and it's at the end of the genome However, when we for example look at the central african region So the other clade because in africa you have two different strains of monkeypox You have the west african strain and you have the kind of central african strain But the central african strains Don't really seem to have two copies of this gene They only seem to have one copy of the gene at the end of the chromosomes If we then look at all of these here These don't seem to have any copies of the gene. So these here are the indian. So the the goat Variant of monkeypox, right? So this is goat pox goat pox doesn't have this gene at all And we see that the same thing holds for some of these other Entries here, which are like the turkey pox and the the the ones from arsia. So the the The bird version of the genome does not have this gene The more or less the very virulent strain from the middle of africa has one copy of this gene And the monkeypox the standard monkeypox which are not that Heavy more or less they have two copies of the gene And we see that this is similar to for example the smallpox and the akmetah, right? So akmetah and smallpox are here Smallpox seems to have one copy of the gene at the end very similar to the african strains So the middle african strains not the west african strains But the akmetah virus is very similar to the monkeypox virus and it has two copies of the gene So of course we don't just want to do this for a single protein, right? We wrote all of these functions to do blast and to all of these things So we we kind of want to scale this up and not look at a single gene, but probably look at four or five different genes Right, so so let's just let's just do that right because we can use these blast results but we have to make some changes because Doing it for one or doing it for many is of course slightly different So let's just say well We create here a protein model, right? So we take our first protein We do a model and then we give it my consensus So let's just do this So for the first protein So let's just do this a couple of times, right? So say for x in one two Five, right? Let's take five different proteins Make a consensus sequence and then we need to remember so my consensus Right, so this is going to keep five of these consensus sequences and i'm just going to say see the my consensus Right and since we are now looking not at the first one, but at the next one We're just going to do it like this. So let's let's just run the code Um, see if we can get five consensus sequences instead of one um, so of course it needs to do the msa every time it needs to figure out the consensus and of course it needs to find all of the protein distances So it's going to be slightly heavier. Um, and did three Waiting two more. Okay. So now when we look at my consensus My consensus Oh crap, I didn't put the you see here like very very small coding issues But I am going to add my consensus to my consensus. All right Let's go to r. Let's do the same thing a little bit of a shame that has to do a little bit of waiting again But this is kind of how you build it up So you do it for one and then you start saying well What would happen if I would now look at five genes or if I would look at six or seven, right? How would the genome structure change and of course we can also just zoom in right? We could we don't have to look at this this this because hey, of course There's distance so there might be that some genes are longer than other genes Which we already saw from the consensus sequence All right, but if we now look at my consensus, right We see that we have indeed five different consensus sequences of varying length Some of them are perfect and other ones are not I I just I'm just going to have to give them names, right? So the names here are the names that I kind of want to give them. So I'm just going to say Give them names Of my consensus My consensus Is going to be a paste and I'm just going to call it gene Or no, I'm going to call it consensus comma one two five And paste zero and make sure that you do an underscore on there, right? So just give it some names So that when we talk about my consensus, they actually have names. So consensus one consensus five Perfect. So now we want to blast them But the blast function that we wrote only deals with a single Or does it that's the big question does it or does it not because when we look at the blast function that we made We have here a single sequence because we fill in the name Right, but we don't have to fill in the name. We can just say for sequence in Sequences, right? So we're going to instead of having a function which works on a single one having a function which works on multiple ones Of course, I can't cut consensus binding protein anymore, but I can use the name, right? So I can say names of sequences So I'm going to say Name So I'm going to cut A greater than symbol that I'm going to cut the name of the sequence and now instead of Using sequences here. I'm going to say sequences at This name, right? So I'm going to create a fusta file instead of having Two lines one for the header one for the sequence. I'm now going to create a fusta file which has Multiple query lines in there And each of these query lines will have or so the first line will be the header second line will be sequence Third line will be header fourth line will be sequence again Good. So blast is done. So we updated the blast function. So it should now worked on on multiple. So instead of blasting My consensus, we're going to blast my consensus Um, and then we should get back what we want. Um, so Let me see. I only updated the blast function. So I'm going to source the blast and I'm going to do the blast res And I'm going to say blast dot res just to check And um, well, let me only show you guys the first four five columns So here we see that it indeed blasts not only the consensus sequence one, but also two three four and five Right. So now we have multiple positions in the genome. So we can now do multiple images or add multiple genes to the plot So let's go back to notepad. So this all works. So now we we just have to add another for loop here So instead of going So have going through the names of the row names of the source ID I Want to know which gene we selected So have which which uh Genome it fell on then I want to go through all of the hits and then I want to color it And now I want to use different colors, of course, because now I want to have the color of the node be based on It being consensus one two three four five, right? So How do we figure out which one we have? Well, we can do that here. So we can say that q For query. So our query is actually here in q sec id Right query sequence id. So the first column contains the query sequence id and the second one So now if we do we can just cut we can cut the query and then put a new line behind it So if we now run the code to plot it should print all of our query sequences To the screen and it should actually fill them up all in orange But we don't want everything to be orange because then we can't compare which gene is located where So what can we do about that? So we can just use If you look here, we see that consensus one two three four five. So we can just say well, I want to string split q I want to do that by this underscore which I put in and then I want to unlist or I want to L apply to this The selection function and I want to have the second element because that contains the number Right because it's consensus underscore number. So this will give us the number And we are going to unlist this of course just for good measurement So if we would do this in r, then it would show us that the last consensus that it looked at was number five Good so now we can define multiple colors Oh, I didn't show you guys the r window. So when you have when you just do the unlist Right. So the q that we're looking at is this consensus five and this just takes the last element All right So now we have n and now we need to define some colors Right. So we are looking at five genes So let's just use color brewer To get five nice colors. So I say brewer dot pal Give me five colors from for example set two And I call this calls Right and now I have my number so I can now say instead of using orange use calls Right, but use n So which one do I want to select? Well, I want to do as numeric n And now it should use five different colors one for each of the different Proteins that we built a consensus sequence for All right, so let's start looking at the genome And now we start seeing more and more structure of the genome We only look at a couple of genes here I could actually zoom in right by saying that If we look at the code, I'm saying well the We get the starts and the end positions right for the the gene But when we set the plot we set it to the minimum to the maximum But we can also say well, I only want to see for example not The maximum but I want to see the first 25 000 base pairs, right? Just zoom in on the lower half of the of the genome So when we zoom in to the lower half of the genome Then we see something like this So we start seeing genes right it also shows the where the gene starts and ends I don't like the both actually let me actually do the plot twice I want to have points once and I want to have lines And when I do lines, I want to not have a line type And it could be like this right just to update it a little bit So instead of using the b for plotting both the points and the lines I want the lines to kind of connect the two dots Right, so now we start seeing that there is genomic structure For example this Nigerian strain from 2017 looks very funky, right? It has the same gene twice on the genome This gene is three times on the genome when you look at the current monkeypox virus But this one is an outlier here This Nigeria 2017 sample is not like the other monkeypox viri that we looked at We see here for example that we have the Nigeria 2018 sample here This blue protein is truncated So it's not the whole protein It's just part of the protein the the other part of the protein is either gone But it's not because it so there's something going on there But we can see now also that the more deadly monkeypoxes they have this blue thing But are they if this purple thing but much closer together And we see here that for example the USA 2003 outbreak They had this green gene which does not occur into the current monkeypox virus Or in the West African strain where it originates from We can see that this green gene actually does occur in the Akmeta viruses So in the other box classifiers So we can start reasoning about these things And of course we we don't have to just look at the genome We can zoom in much more right If we want to look at the first like seven and a half thousand It's just changing the code saying seven and a half thousand And we just do the plot again And when we do the plot then we zoom in a little bit more And now we start seeing really structure of the genome So we see that our gene here is actually found the first one in the USA And this is probably why they build version two right Because this is version one of the genome So you can see that there are issues Because this protein is like it has multiple stops and ends So it's not consistent But this is what I wanted to show you guys So hey you now have the ability and the code Let me commit the code to Gitter by the way I didn't do that So I want to add Gitt add r slash download data dot r Gitt commit minus m download data Gitt push Gitt add r slash blast db dot r Gitt commit minus m blast db dot r Gitt r slash I think I already pushed the annotation So we have query proteins dot r Gitt commit minus m query proteins row proteins Very good And then the last one is blast So Gitt add r slash blast dot r Gitt commit minus m blast dot r Very good Blast Okay and then Gitt add analysis dot r Oh Gitt commit minus m and off stream code Gitt push All right so I just updated the whole repository So let me show you guys the repository just to make sure So when you go to monkey right monkey 22 on my github Then now we have the analysis code that I just coded with you guys Or for you guys to show you And of course we have all of the little functions that we created in the r folder So blast blast db download data query proteins And when the data folder we have monkey box annotation So that's kind of what I wanted to show you guys that in two hours we started with nothing With no knowledge about monkey box and currently we have an ability to download the data Download proteins use these proteins to annotate the current unannotated versions of the genome We can compare our virus variant that we are currently looking at To many many different other virus variants and we can start drawing conclusions We can start drawing trees as well So with that that's what I wanted to show you guys If there are any questions then please let me know But for me this is kind of what I wanted to show you guys To show you how easy it is to use r using several packages To kind of get an idea of what's going on and do real bioinformatics Even with just following a basic programming course in r We're like we've now done from the r course which generally is like 12 lectures We've done like six lectures And I think everything that I use today is more or less lecture one to lecture five So if you are following the current streaming course online Then you should more or less be able to follow what we did and You should be able to replicate the code if you wanted to Right, so it's just starting off with an idea knowing where to get the data So for that follow the bioinformatics course to to know which databases there are And where you can get your information and for the rest We can make these really beautiful plots comparing 75 different monkeypox viruses And of course you can add more If you download the FASTA files from virology.org You can just load them in and add them to your database before you make the BLAST database If you are thinking like I'm a conspiracy theorist Right or I really would love to know if Any part of COVID-19 is located in monkeypox you can do that Just add COVID-19 genome to the genome database Download the COVID-19 proteins and start blasting them To the monkeypox virus to see if they're there Right so with this I'd like to thank everyone who's watching And thank you guys for being here And if there's any questions just hit me up in the comments down below Or go to the github make an issue there Fork the github take the code do with it what you want So just play around do your own analysis Do your own research like people say But do it in a way that makes sense Right so use raw data Make your own analysis and draw your own conclusions Good so with that we're more or less at the end Two hours thank you guys for watching And I will see you next week Thursday for lecture number six Probably it's going to be linear regression I think Something like that So I will see you guys in a week on Same platform slightly different time slightly different day But thank you so much for watching And just looking at chat there's no questions So perfect then Catch you guys later and thanks for being here