 Alright, so welcome everyone. If you're watching this on Twitch or later on YouTube, thank you for joining this lecture. Today, we will be talking about sequence analysis. I hope everything goes well, had a little bit of technical issues, but that should all be solved by now. So, let's just start with the first slide. So, unfortunately for you guys, or fortunately for you guys, I actually managed to secure a date for the exam. So, the exam will be on the 17th of February at 2 p.m. So, write it down in your agenda. I don't know exactly how we will do the exam, if it will be in person. So, if you have to show up here, if it will be online or what kind of format, but I am going to be very, very generous. So, if people show up to the exam, then I won't make it too hard. So, 17th of February, 2 p.m., note it down in your agenda. I will also send around an email of course to remind everyone, so the people that are not watching the stream or watching it later on YouTube. Oh my God, you got a date. Yes, yes, I actually managed to call the pre-functional bureau. And you know what they actually told me? Like, oh, yeah, no, no, you send that email like four weeks ago, but I haven't been in the office for seven weeks. So, shocko, yeah. But we got a date. So, at least we got a date. Now, we still have to figure out how we're going to do it. But I think we'll manage. Like, I'm generally really, really relaxed about exams. And I hope that I will be able actually to do a drawing exercise for you guys, because like, I do drawings for you guys when you save up enough fewer points, right? So, on the exam, you guys are probably going to have a drawing for me for extra credits. That's kind of what we always do. So, be prepared. Have your pen and paper and coloring pencils ready. A puffer fish. Is there also a second appointment? Yes, I submitted a second date, but since I was on the phone with the pre-functional bureau and I didn't want to make it even harder for them, the second date is going to be in April somewhere. So, yeah, I would like everyone to do the first exam because that's the easiest for all of us, right? Then we can just do one exam. I can grade it. Everyone gets their passing grade and then we're done. But yeah, there is also a second appointment and I will let you know in the email. Although the second appointment is not finalized yet, so it might still change. But I will make that clear in the email. Good. So, good luck for you guys on the exam. I will probably mention the same thing next week again, just to make sure that everyone has it fresh in his mind that they should start studying for the exam. Good. So, the overview for today is going to be sequence analysis, right? So, I added the last part. So, the last part is completely new. Never have done it before. I put it in the back so that everyone stays and watches the last part of the lecture as well. And I'm hoping that that will be fun. But we'll start off with the genome annotation. So, how do you annotate a genome or a new genome that you make? Genie 8-8. Yes, the second exam date would be great. Some exams are overlapping. Yeah, yeah. So, there's always going to be a second date and if we do an oral exam, then of course we will do the oral exam somewhere, right? We can more or less freely plan it then. Why don't you have a little diamond genie 8-8, by the way? Let me fix that for you. So, let's see, community. Where do I do that? It's a long time ago that I actually set a new VIP for my channel. So, add a new genie 8-8. So, next time you say something genie, you will have a little diamond in front of your name as well. I tried to use the VIP status for students, so I know who is a student and who isn't. Although, Misha has one as well. But these are every week. So, that's why he deserves one. Good. So, genome annotation, right? So, how do you annotate a novel genome? Imagine that you are sequencing a puffer fish, right? And there's no puffer fish genome available and then you get the DNA sequence. And how do we now figure out where genes start, where they end, and these kinds of things. Then I wanted to talk a lot about sequence alignments because alignment is kind of the core of bioinformatics, right? So, bioinformatics started as a field in like the 1960s, 1970s when we started doing like DNA sequencing and protein sequencing. So, how to align protein sequences against each other and how to align DNA sequences against each other is kind of a fundamental part of bioinformatics because that's where the field of bioinformatics started. So, we will talk about pairwise alignment, what's the difference between global and local alignment. We'll talk about multiple sequence alignments. So, what happens when you have more than one sequence or more than two, right? So, imagine that you're interested in the evolution of a certain gene. Then of course, multiple sequence alignment will help you to kind of figure out how these sequences are related. I wanted to say a few words about structural alignment. So, alignment not based on the sequence but based on the 3D structure of molecules like DNA and proteins. I wanted to talk a little bit about DNA motifs and genome assembly using whole genome sequencing since those things fit in quite nicely with multiple sequence alignment. And then of course, I'm going to teach you guys how to do this in R so that you don't have to go to a website and click and point. No, you can actually just write a script in R which does multiple sequence alignment for you and combine this with biomark. So, the automatic downloading of sequences from ensemble or other databases, you can actually build quite nifty tools which actually do quite interesting things. And I will show you an example of my own work. But of course, as always, we will first do the answers to the previous assignments. And I actually thought about it because for the YouTube recording, it might actually be smarter to do the solutions at the end. But I think that for you guys, it's good that you can just watch like the first part and see, okay, so I wanted to just get the answers to the assignments. But let me actually open up my Notepad window so that you guys can see. So here it still says lecture nine, but because we had the R introduction lecture, it actually moved one up. So these are my answers. So let's go through them one by one. So I hope that everyone was able to do them. At least the PubMed stuff should be relatively easy. So question one was using PubMed, find all my publications and remember that I've only been publishing since 2010. Right? So the standard way that you would do this, you would just say PubMed, right? And then you would say, okay, so I'm interested in this guy, this guy is called Danny Aaron. So let's look at his publications, right? And you just search for my name on PubMed. Let me make this a little bit smaller and get rid of the cookie pop up. So you see that it gives you some results, right? And it actually looks pretty good. So you see that my publications are there. And it seems that nowadays it actually figured out that I am a new person publishing instead of the old one. I used to have a lot of publications from some guy named Day Arons as well, who was actually publishing in the 1960s and 70s. So of course, those are not my publications, but since the name is very similar, then you can find like his publications as well. So if you probably would search for Day Arons, then probably would find the older ones as well, right? So here you see the publications written by this other person. And then in 2010, it starts by me publishing. So only the ones from 2010. Good. So the question was find all my publications. And then the answer was, oh, so there's not really an answer. So the question number B is why are there so many false positives? Well, there are so many false positives, of course, because Day Arons as a name or is not meek, right? And if you would search for Denny Arons, then you probably wouldn't find all of my publications. So there's also false negatives, right? Because not every publication that gets published actually has the full first name of the author, right? So you can see that in total, in PubMed, there are 160 results. So 160 publications. When I then search for my own name, so my full name, then I get 40 results. But if I look at my Google Scholar, right? So to figure out how many I really have, because on Google Scholar, I'm pretty good at keeping the track record that I do, then here it will tell me that there are more than 40 publications in total. So you don't find all of them easily. There are some filters that you can add, right? So if we go here, then we could say Day Arons, right? That's what we want to search for. So we just do it like this. And then we want to do something like publication date. So from 2010, so 2010. And then I have to use the user guide to get the exact way of writing it. Key concept, searching by journal, searching by date. So use the results timeline. That's one of the ways, right? You could just drag the slider. And you can also use the search builder. So you can say publication date like this, right? So I can just add 2022. So this then would be 2010 slash 01, 01. And then I would have to add and say that this is a date of publishing search. And then there are no results. Is that because of the fact that it doesn't match the name properly? Interesting. So now it actually, so doing this, I find a lot. And then saying from publication date, oh, you have to specify the Boolean operator. All right, so okay, let's see if that works. Your search for process with automatic term mapping, because it retrieves zero results. Interesting. Interesting. Interesting. All right, PubMed is a bitch sometimes. It's sometimes really hard finding stuff that you want. But then you can use the automatic term builder, or you can just say, well, just search for my name, right? And then just use the slider like they are recommending. So I could just say go from 2010. And then it will update. And now you see that it finds 104. Of course, I want to search for day arenas. So it's interesting that it finds vitamin D in oncology, while the term arenas is probably not there at all. This is also not one of my publications. Interesting. So it's relatively difficult getting a good overview in PubMed. But you can use the query builder and you can use the help. So for you guys, just go through it and figure out how to do it. I also put a link there. So use a more complex query using search field tags. So that's what I try to do by adding the publication date, but then it doesn't show anything. Good. So let's go to the second question, right? It's not that important that you guys know how many publications I have. That's of no interest to anyone except for myself. All right. So the next question is uniprot. So let's first go to uniprot. There we go. So let's go to uniprot. And now the question was, find the names of supported query fields from the uniprot help. And fortunately, I actually had the help link open. So here are all of the query fields that you can use in uniprot. And query fields are stuff that allow you to kind of tell the search engine what you mean. So these are things that Google doesn't do, right? So in Google, you can search, and Google is pretty good at searching, but they're very bad at filtering stuff. So if you want to say I want to have web pages where the subject of the web page contains a certain word, then you can't do that. But with uniprot, you can, for example, do things like counting, right? So you can say, list all entries with exactly five transmembrane regions. So hey, you can build queries, and the search engine will understand that you are searching for things like, hey, so transmembrane count is five. So it will understand that and then we'll give you back search results, which are much more logical. All right, so how many reviewed protein entries exist presently in uniprot kb for chicken? All right, so let's go back to uniprot. So we are interested in chicken. So we have to do this. Good. So because I want to have reviewed, so reviewed protein entries, so I have to say reviewed is yes. And then I say and organism is gullus gullus, which is chicken, which is 9031. So the query that I actually typed in right is like this. So I specify the organism and then I specify that I only want to have reviewed protein entries. So if we do this search, then in total it will tell us that there are 2,297 verified entries for proteins and chickens. How many of them have been created since 0109, 2011? So again, we update our query. So now we add a created date to the query. So we change our query. So we say reviewed is yes. Organism is chicken, so 9031 and created and then 2011, 0901, 2 star, because we want to have it up to today. So if we do this query, then we see that since 2011, only 87 verified protein entries have been added for chicken, right? So the results only shows us 87, which is not that much. But of course, there are not that many people working on chicken. All right, retrieve the reviewed entry of cattle myostatin. What is the alternative name and what is the accession number? So of course, we want to again say, because the organism number for cattle is 9913. So we want to say, give me myostatin, make sure that it is reviewed and that the organism is cattle. And of course, you can get the organism identifiers from the help, of course. So you can see here that when you search for myostatin that the alternative name is growth slash differentiation factor eight, which is also the old G name, because MSDN is the current G name, and the previous G name is GDF eight. So if you're looking for myostatin in literature, right, and you're looking for literature from the 1960s, 70s and 80s, then people would use this gene name. So they would talk about GDF eight in cattle. But what they are actually talking about is myostatin in cattle. So hey, you have to keep in the back of your mind that a lot of these that a lot of genes actually have two or three names. So by just searching for myostatin, you will only find the subset of the available literature or subset of the available proteins that are out there. So and that that is a big issue in bioinformatics, of course, because a computer can't understand that MSDN and GDF eight are the same thing, but you as a person can. So what are the gene names of these proteins? So MSDN, GDF eight, but it's also called MH. Why it's called MH as well. I don't know, but some people probably use that in the past. And of course, the current name is MSDN and GDF eight and MH are called synonyms. So they are, they are the same thing. All right, to which ensemble identifier can bovine myostatin be mapped? All right, so now we have to go inside of the entry, right? And now we want to know the ensemble ID, which should be right. So here we see that in the organism specific databases, we see a ENS. So ensemble, BTA, G, and then a whole bunch of zeros. And then it's 11,808. So myostatin is the 11,808 gene in cattle. At least that is that's how it's annotated. And then, because this is the gene identifier in Hostay Bay, but it's the same for ensemble. And of course, a gene or this thing also has a protein identifier. So here again, you have the gene ID, but there should also be here. Yeah, so here you see the ENS, BTA, so ensemble, and then you see P for protein, right? So that's how the ensemble identifiers are built up. So you have first three letters saying it's ensemble, then three letters specifying which species it is. And then you have a P for protein or a G for gene or a T for transcript. So in this case, this one here and this one here. So in the answers I wrote down, 11,808, and the protein number is 16,6567. That's just how it works. Display the entry in raw format. All right, let's go to raw format. Let's see where the raw format is. Format, I want to have a raw format, which is text, right? And then we see this. So when you are searching through this database and you're using a computer, so imagine that I'm using screen scraping. So I'd use R to connect to the database and then wants to get this entry. And then you see here the entry at the computer. So the way that they structure these entries is that, for example, you have an ID identifier, and then you have a tab or a couple of spaces. And then you see here the information about, so this is the identifier. So here we have CAC, which are the accession numbers. So these are the accession numbers at which you can find it in the protein database. Then we have, so all of these little things in front, so these two letter codes in front, they tell you what is going to come next. So what is going to come next? For example, the ID is the identifier is going to come next. If you see the RP, then it tells you something else, right? And if you see, for example, here are A, which are probably the authors, then you see the names of the author. So by using this, you can actually use the computer, load in this text file, and then say, well, I only want to see certain part of this text file because I'm only interested in certain parts for my analysis. And of course, they also have things like sequences. So sq means sequence. And then you see here that you have the sequence in almost a FASTA like format, but not exactly, right? Because it's not having this larger dense symbol. All right. So the question was, what does the line IDs DT, OS, DR, and FD mean? Did I actually write that down? No, I didn't write it down. So let's see if we can figure out what it means. So let's go to DT. So DT is the date at which it was submitted to the database, right? And then if there are any updates, then it gets, so the DT is the date, date time, date time for the entry. So there's three different ones because they had this is the original submission in 1998. And then there are some updated submissions in the past or in afterwards. OS. So let's search for OS. Let's see where the OS identifier is. We don't have an OS identifier. Oh, here. So OS is, of course, the species at which this protein was determined. Or so it OS tells you the species that the protein was identified for. Then we have DR. So DR is interesting because you see that after DR, there's a whole bunch of different things, right? So you see EMBL, then a code, and then another code, and then it says mRNA, right? So DR are actually links from this gene to other databases, right? Because you can see that here after DR, it says, for example, Pax de Bay. So this entry is called 018836 in Pax de Bay. If you go to gene tree, then actually it's you gene tree is actually using the ensemble identifier. So these are all identifiers for the same protein or for the same gene, but in different databases, right? And you can also see that there are DRGO, right? So these are gene ontology terms which have been assigned to this protein, right? So it's muscle cell, your home or your stasis protein, right? So it's something that works in the muscles. And then the last one was FT. So FT is actually here and here FT actually tells you the structure of the protein, right? So you can see that if you look at myostatin, then the first 18 amino acids are the signal part of the peptide of the protein, right? And then we see that there's a probe peptide which is 19 to 266. Then we see a chain which is called the growth slash differentiation factor 8. So the old name of the whole protein got assigned now to a part of the protein. Then you see that there's a site. So there's a cleavage site at which the protein can be cleaved into two parts at amino acid 88 to 99. So that's where it cuts between these two amino acids. And you see all kinds of other like structural components from this protein, right? You see that there's disulfide bonds. There's like a carbohydrate thingy. So there's an N-linked asparaginine. So there's an amino acid which has been chemically modified. And there's some other things like variants that are there. So if we look at Piedmont, so not the standard cow. So in other races you actually have amino acid changes, which also get recorded into the database. So the FT, the FTID just refers to this is description of how the protein is structured. All right. So then the next question was Blastemiostatin protein sequence against uniprot vertebrate database default parameters, which species saw a significant identity? So we can go back. We can say Blast and then we just say default because that's what the thing asked us. And then we go and this might take a little while because we're in the queue. So it could take up to a half an hour before we get our results. Good. So first things first, or let's just wait for this one. But let's move on to the third question. So the first question was doing stuff with bioconductor and using an installing biomark. So one of these drawbacks is that are the, how do you call this? So our programming language has its own package manager, which is called CRAM. So on CRAM are all the official packages. But for bioinformatics, we have another repository holding like 12,000 packages very specifically aimed at bioinformatics. And that is called bioconductor. So bioconductor contains packages which allow you to do annotation of microarrays or analyze cell files and these kinds of things. Right. So because all of these things are relatively frequently updated for the authors of these packages, it is not worth to submit to CRAM because when you submit to CRAM, you're kind of inside of this harness and then your package is not allowed to change that much. Because you can't submit updates every week because CRAM just says, no, we're here for stable packages that don't really change that much between like January and February or January and December. Right. So they only allow you, I think, to update your package every three months or something. But of course for bioinformatics, since a lot of packages are relatively new and new algorithms get added on a weekly basis, sometimes people don't submit their package to CRAM, the standard repository, but they submit it to Biomart. So to install Biomart, so they submit their packages to bioconductor and not to to Biomart. But first, we have to install the Biomart package. Right. Because if we want to connect R to Ensembl and other biological databases, we have to have this package installed. So let me go to R. So first things first, I can see if I already have an installed using the installed packages command. So if I do install packages, then it will give me a long list of everything which is already installed in R. Right. So and of course, I don't want to see all of them. So I just want to see the names of the packages. Sorry, it's a big matrix. So you want to look at the row names. So you can see that in total, I've 346 packages installed. And I want to know, for example, if Biomart is one of them, so say grep, I say Biomart in the row names. Right. And then it will tell me, yes, it is installed and it is installed like entry number 22. So if I would just say installed dot packages, right, which gives me back a matrix and I say, look at row number 22, then it gives you back something like this. So it tells me that Biomart is already installed. This is where it is installed. And then it gives you like what other packages does it require? What is the license and all kinds of other stuff. So I don't have to install it anymore. But let's just install it to make sure that you guys know how to do that. So the first thing that you have to do is source the Bioconductor library. And let me actually show you guys by going to Bioconductor, time typing. So this is Bioconductor, right. So Bioconductor is a repository for the R language focus on bioinformatics. So here if we would search for Biomart, right, then it would find the Biomart package probably. So here we see Bioconductor Biomart. And we just click on the first search result. It's currently not building properly. But you can see that it's like the 18th package out of 2000. That's on Bioconductor. And it's been there for 16 and a half years already. And if you scroll down and you see here the installation guide, right. So if you want to install it, what you do is you just copy paste this from here into your R window. So let's just do that. So let's go to the R window. And just copy paste it in. Then it will say this is Bioconductor. And then it wants to update like a whole bunch of my packages. I don't want to update now. I just want to install it. Okay, so I'm not allowed to do that. Package is not installed when version same as current. So I already have Biomart installed. So it tells me that I can't reinstall it twice, right? Because I already have the latest version. So I hope everyone was able to install Biomart into their R session, right? But once we have it installed, we can make it active by typing library and then just type library biomart with capital R. And then it will load the package and it will tell you in the meantime that there's actually sub packages that it needs. So it also loads those. But after it's loaded, then everything goes okay. Alright, so the first step is to connect to ensemble, right? So we saw this in the assignment. So we can just say mart is, let me look that up. So we can say mart is use mart, right? Because we want to use a certain biomart. And in this case, we want to use ensemble. So let's connect to ensemble. Let's hope that it works. It might be busy or overloaded. So it might take a little while as well. And that's always one of these issues, right? If you do stuff, which requires an internet connection and it's academic software in a way, because Biomart is not a company that gets paid for doing this, you sometimes have to wait. And that's just one of these drawbacks. When you do buy informatics is that you are dependent on other people who are either underfunded or they do have enough funding, but they have 10,000 users. So it's not as snappy and responsive as Facebook is. But we connected. So that's really nice. So the first thing that I want to kind of figure out is because in the question we want to, so the question is, is how many data sets are available in the ensemble mart, right? So we can just do list data sets mart and I will store it in a variable so that I don't have to query it again. And I'm going to just call this DS for data sets, right? So what this will do it will make a connection to ensemble, ask ensemble, how many different data sets do you have. And then when I look at the dimensions of DS, it tells me that there are 214 different data sets that I can use on ensemble. If I would look at the first five, hey, you see that it's just a matrix that it returns you. And the first is the data set name, right? So I need to use this name to connect to it. Then it gives you a description and then it tells you the current version of the genome or of the gene that you're connecting to, because of course versions are important. And of course, you always have to write down your version that you're currently using. And if you publish a paper using either bio mart or other things, you also want to mention, well, I used the genome build mitus five for my, my dos cyclic genes. I have no idea what a, is that like a little bug like this? Is it like a psychiatrist? I don't know what a cyclic is. But if we use the golden eagle, right? So if we do some genomic analysis on golden eagles, cyclics are fish. Okay, so here again, like my, my perfect knowledge of like biology is showing that see even Misha knew that they were fish. Now I really look like a jackass for not knowing this. I'm sorry, it's good that Daniel's not here. Daniel would have laughed at me for like not knowing that a sick kid is a sick lit is a fish. But anyway, and so if you use this, then, then you, you, you have to mention that you use version five of the genome. Yeah, well, you can know everything, right? Like I'm no bioinformatics, but like biology is not my strongest in a way. Like I know a lot about biology, but like, I don't know all of the names of all of the fish in the world, because I've never studied fish that much. Right? I'm originally a plant biologist. But even then, like if you pointed a tree and asked me what kind of tree it is, I probably wouldn't know. I should know, because our professor was always very adamant about like us learning biology, because like, if you're bioinformatics, you should know the biology, right? So in my old group, we, we would walk around the arboretum, and then Richard, my old professor would point at a tree and would ask a random person, like, what is this type of tree? So I still know a lot of type of trees, but like it if you don't use it, right, that you forget. So that that's how it works. Anyway, question number three, how many data sets are available? So at the moment, there are 214. There used to be less. So in the answers, it said 203, actually. So 11 new species got added in the last year, or since we did it last time. All right, so we can connect to ensemble and use a specific data set by reconnecting using the use smart function. We now specify two parameters, the data set, the database that we want and the data set that we want to use. So in this case, we wanted to connect to mouse. So then of course, we just issued a command use smart, connect to ensemble and select the most musculos gene ensemble. Ciclids is a commonly kept aquarium fish. I didn't know that. I know, I only know like this much about fish. And I know that they taste good. I had salmon yesterday. That was pretty good. Good. So again, right? It's it's free software. So you just have to wait your turn to be able to connect to the data set. And it's, it's one of these drawbacks, right? Because biomark is really good. When you use it, it's just that it's not the quickest. But fortunately, actually, the blast results for the assignment to finish. So we can look at that while our starts connecting. So here we see the results. Let me zoom out a little bit more. Right? So what you can see here is the results, right? So you see here, we used the myostatin gene. We blasted it to their genome database. And then we can see that it is related to Indian bison. It's related to Bostaurus, right? There's 100% identity to the own sequence. We see that it's also very similar to Boss Indicus or to Zeebu Kettle, right? And you can see that because the E value is zero and the identity is 100%. So if we want to have the first entry, which is not 100%, then we see that the most or the closest protein to myostatin to Bostaurus myostatin is actually found in domesticated water buffalo. So that has a 98% similarity. And of course, the E value is still zero. And if we then go down the list, when you can see that the myostatin gene of cows is very similar to the myostatin gene in giant eland, which is interesting. But then one of these funny things that I always find interesting is that when you look at cow proteins, right? Then if you do a protein search for evolutionary distance, then when you look for cow proteins, they show a very high similarity to whales. And that is of course, because of the way that whales came to be, right? A whale is a mammal. It lives in the ocean. But a whale is like, so there is a common ancestor between cows and whales. And whales didn't go back to the ocean very, very long ago. It's not a like million, million year history. It's not a billion years ago that this happened. But I always find this really funny that when you search for cow genes, that like whales and dolphins come up as being relatively closely related, even closer related like things like sheep and goat, right? So the common ancestor between cows and sheep is longer ago than the common ancestor between cow and whale. So whales went back to the ocean not that long ago. When when so cows and sheep split off much earlier in the evolutionary tree, then cows and whales, which I always find very, very funny, because you expect whales to be there. Good. So connect it. All right. So now the next question was connect to the mouse database. We did that. How many attributes are provided in the musculos gene data set in ensemble? Okay, so we can just list all of the attributes, right? So we can use our mart, say list attributes of this mart, and then I'm just going to save this in ATTR. And now if I ask for the dimensions of ATTR, then you see that there are 2,983 different things in this database that we can query for. So if we look at a couple, right, so let's look at the first like 10, then we can see that we can retrieve gene, our gene IDs, we can retrieve gene descriptions, we can receive chromosomes and start positions of genes. And of course, if we if we look way, way down, and then you can see that there are also some very esoterical things that you can, for example, find the Bison homologue confidence score of the orthologous gene, right? So there's a lot of things that you can that you can actually query for. But in total, there are 2,983 attributes that you can query for. And then when we list all of the filters, I think that's the next question, right? So we just say fill, which is list filters, I think list filters, and then from our mart, and then we ask for the dimensions of FIL, then you can see that there are 396 different query options, right? So we have 396 different ways of telling ensemble what we want to retrieve, right? And if we look at the first 10, right, then, of course, we can search by chromosome name by start and end position by strand. We can search for a whole chromosomal region. But we can also search for with all kinds of different IDs, right? So we can search, for example, if we have an interest gene ID, or if we have a European nucleotide archive ID, we can also use that to retrieve genes from ensemble, right? So that's the nice thing, because in many cases, when you have a list of gene identifiers, and you want to query other databases, then you first have to translate your identifiers and biomarkers able to do that for you, right? Because you can just query using ensemble IDs and then retrieve, for example, the gamble IDs. So those are the IDs for the chemistry database, the EMBL database for chemistry. Good. So how many were there in total? 396 is the answer. All other interactions with biomass are done using the getBiomart function. The structure is as follows. Read the provided help and try to understand what attribute filters and values are. So I think that was made very clear in the previous lecture. And then for a small example, imagine I want to retrieve the chromosome at which myostatin is located. I know that the MGI symbol for myostatin is MSDN, the same as in cows. I can then retrieve this using the following R command. So just show you guys that that command that I input it actually works. I'm looking at my PDF file that I uploaded, actually, so the PDF file with the answer. So I'm just saying copy this out, go here, and then this. So this was the little example that was in there, right? So what I'm going to do is I'm going to say, what do I want to retrieve? Well, I want to retrieve the chromosome. What am I going to provide? I'm going to provide an MGI symbol. The value is MSDN, and I want to search in the currently connected biomass. So let's do the query and then it will tell me that myostatin in mouse is located on chromosome one. Good. I can add as many attributes as I want to retrieve the important part here is that I set my filter to the MGI symbol. Do not setting the, ooh, that's a very poorly written sentence. When I do not set the filter parameter, what happens and why? Okay. No idea why I asked that question, but all right. So when I just do biomark, right, and I say, get the chromosome name and give it MSDN, then now all of a sudden you see that it comes back with almost all chromosomes in mouse because it has no idea what MSDN means, right? So I have to say to the database, the value that I'm giving you means this because otherwise it will just do a database wide search. And then of course you will, you will find the letters MSTN in many different positions in the genome. All right. So what is the genome location of the myostatin gene? So chromosomes start and end positions. So again, we use the same logic, but we are going to add some parameters, right, because I don't just want to retrieve the chromosome name. I also want to have the start position and the end position. So it's start position comma end position. Again, the filter is the MGI symbol and the value is myostatin, right? So if I do this and I query biomark, then it tells me that myostatin is of course located on chromosome one. We already knew that. But then the next is that, well, myostatin starts at 53 megabases, 53.1 megabases, and it ends at 53.1 megabases, right? So you can see that this gene is, let me actually compute that. So the gene is 6439 base pairs long. Good. Next question. So now we can, so what is the genome location? Okay. So now for something more fun, depending on your idea of fun, let's retrieve all genes near the myostatin gene. We can do this by using the chromosomal region filter. The locations are specified by chromosome double point start double point end, right? So here again, we want to use a slightly different filter. So let me switch you guys to the answers. Right. So here we say, get biomark. We want to retrieve the MGI symbol, right? Because I'm now not just retrieving myostatin, but I might retrieve all kinds of other genes, right? So I'm going to say, give me the chromosome name, start position and end position. And then the question was, locate all genes that are within 200,000 base pairs of the myostatin gene. How many genes are within this region, not counting myostatin itself? So here I already did the mathematics, I think. Is it proper? Yeah, seems to be kind of okay. I might have, these might have been based on an older version, right? So the start position is this. I think I took half the halfway point of the gene. So I just computed, I took, right? So I computed, because I'm wondering where this number comes from, right? Because the start of the gene was at 53.1. And here, no, so that's, yeah, so 53.1, 200,000 base pairs earlier is 52.8. And 200,000 base pairs later is slightly above this. But the answer doesn't matter too much. Have what matters is that you guys get to learn how to use biomark and how you can build up these queries, right? So here we see that if we ask for the MGI symbol, the name, the start and the end position, we filter in this case by chromosomal region. I give it a chromosomal region that I want to retrieve. Then it tells me that there are one, two, three, this one doesn't count, four, five, six, seven genes within 200,000 base pairs of myostatin located on mouse chromosome one. All right, we can also combine filters using the C function. Imagine that I only want to have protein coding genes, I can add the filter biotype and filter for genes that have the biotype protein coding, right? So again, same kind of structure that we use. And so let me show you the answer that I had, right? So what I do here is I say I want to still retrieve the same things. I'm going to provide two different search filters. One is chromosomal region and the other is biotype. And now let me actually set the syntax highlighting. Sure why it didn't do the syntax highlighting. But now here I have to use the list function, right? Because I could have retrieved not one chromosomal region, I could have retrieved like five or 10. I could have not asked for a single biotype, but I could have asked for like five different biotypes. So because of this, I now need to provide a list. And this list is just two elements because I'm only retrieving a single chromosomal region. But now I'm saying, okay, so give me this chromosomal region and only give me protein coding genes, right? If I would have added a second chromosomal region, then the first entry of the list would have been a vector, right? So imagine that we can add another chromosomal region. So let's take the same region here, but now do it on chromosome two, right? So now I have to provide the c function because now the first entry of the list is a vector, which has two entries. And then the second entry of the list is a single entry. So let me show you guys how this looks in R, because lists are different from vectors, right? So this is now how the list looked. So the first entry of the list has two regions. The second entry of the list only has one entry protein coding. And it will search for all protein coding genes in this region and protein coding genes in that region. All right, but that was not the question. So let's go back to the question. So let's just do the query and add the additional filter for protein coding. So now it tells me that of the seven genes which are located near myostatin, only four of them code for a protein. The other ones like this GM24349 gene, this is not coding for a protein. So this gene codes for something else. It might be a long non-coding RNA, it might be a microRNA, it might be a tRNA that it codes for, but it doesn't code for a protein, right? So that's nice. And generally, of course, we want to study protein coding genes, unless you're interested in microRNAs, of course. All right, so how many protein coding genes are there? Well, there's four. And then there's an additional question. Biomart contains information on genes. So we need to connect to another database to retrieve single nucleotide polymorphisms. So the answer is here. So how do I do that? Well, I have to switch to a different mart, right? So I have to use the SNP mart in this case. And the dataset is going to be muscular SNP. And let me see if that is still correct. No. So actually, the name has changed. Because here it now says that, okay, so this is an incorrect biomart name. Use the list mart functions to see which are available. So let's do list mart. And then it will tell me all of the different biomarts that I can connect to. I should have stored this because this might actually be a big list. It's not okay. So they renamed the mart from having the name snipped to ensemble mart snipped. So let's update that and then connect again. I hope that the name of the dataset hasn't changed. So just use mart, say use the SNP mart, so the single nucleotide polymorphism mart and then connect to musculos. Good. Solve the assignment. We started in the lecture for all protein coding genes located on chromosome three between 15 megabases and 45 megabases in mouse retrieve information about the number of protein coding genes, number of axons on the number of SNPs for axon. That's a big, big, big question. I'm just going to do the SNP one since I'm not like we already did the other two, right? So we did the axons and we already did the number of genes. So if we want to retrieve the SNPs, then have what we do is we give it the chromosomal region that we're interested in. I don't know why I selected this region, but you can just query, right? So the thing that I'm actually querying for is say give me the SNP ID, give me the allele chromosome name, chromosome start, and then it gives you a big list of different SNPs and it also gives you the position of the single nucleotide polymorphism, right? And you can see that this is actually, this is a small indel, right? So the reference genome has TT, the alternative allele is missing. So it's a two-base pair deletion and here we see a one-base pair deletion. Good. So those were the assignments. If there's no other questions about the assignments, then you can start with the lecture for which we still have five minutes for this hour. Good. So let's go back to the lecture. Genome annotation is going to be the name of the game today. So that is the goal. We want to annotate a new genome based on information that we might already have from other species. Imagine that we did a new type of buffer fish or we did a new type of, what was that fish called? Cichlid. We did a new type of Cichlid, right? This species doesn't have a genome yet or it has a genome but it doesn't have any annotation. Then we can use the homology trick. And then, of course, afterwards we will be talking about sequence alignment, DNA motif, and then whole genome sequencing and multiple sequence alignment in R. Right. So imagine that we have a newly sequenced genome. How do we annotate genes on it? So imagine that I have this little sequence here on a new chromosome that I assembled using whole genome sequencing. Then how do I kind of annotate this? Well, the way that you do this is just take the sequence and then we find a homologous sequence with a known function that is closest to it. So it's kind of like using BLAST, right? So we just take the sequence, then we have a database containing sequences of known function, for example, ensemble. And then what we do is we just compare our sequence to all of the sequences in the database and then at a certain point we will find a match, right? So for example, if we're annotating the whale genome, then of course a lot of these matches will be proteins which are in the cow genome, which are annotated as having a certain function, right? Or a certain name. So that is the homology trick. And the homology trick is very, very old. And people always say like, why does this work? Well, this works because of Charles Darwin, right? We know from the origin of species that their life is a singular event, which kind of happened, right? And then we have life branching off in all kinds of different directions. And of course, we have things like mutation, modifying sequences, and we have all kinds of other structures or all kinds of other processes which will make sequences different in the course of evolution. But if we are looking at a new species, we can always like look at a closely related species and then use the information from the closely related species to infer the function of sequences in our other species. And so this is why does this work. So this works because there was a single introduction event of life on planet earth and every animal living today can be traced back to this single origin. So everyone's related to each other, all species are related to each other. And because of this, we can we can use homology. So how similar a sequence is in a species that we do not know or do not know much about, we can use species that we know much about and then use this as the as the trick to find this out. Right. Like I told you guys, so it is because of evolution. So sequences are of course changed in the course of evolution. And there's things like mutations that occur. We have insertions that happen, deletions, we even have transposals. So snips which are are not we have transposals which are genes which are jumping around in the genome. And we have things like chromosomal rearrangements like a whole chromosome or part of a chromosome can be duplicated. A chromosome can break and can be inverted part of a chromosome. And we have translocations. So it is even possible for a certain chromosome to break. And then the piece which breaks off actually merges with another chromosome that is called a translocation. Right. So the correspondence between homologous sequences is never exact. It's never 100% match, unless you're looking at very, very closely related species. Right. If I'm looking at cows from Europe and comparing them with cows from India, then of course the difference is going to be minor. There's only going to be like a couple mutations. But of course to do this and to say that two sequences or two proteins are homologous, we have to have a method to match sequences to each other. And this way of matching should be inexact. Right. It should allow for duplications. It should allow for mutation, insertions and deletions. Right. And this is called pairwise sequence alignment. Why is there a minus? Let me actually fix that. I'm pairwise sequence alignment. Good. So that is where pairwise sequence alignment comes from. Right. Because since Charles Darwin, we know that we can use homology, but we have to have a mathematical definition of what is similar and how similar are these things. And of course, there's no single answer. Right. Like if you have two sequences, you can look at the number of exact matches. You can look at the number of gaps that you have to introduce and all of these things. So there has to be a method for inexact pattern matching. And that is what we're going to talk about the rest of the lecture. Good. So I'm going to stop here, take a short break of 10, 15 minutes. So the first break for you guys will be that's a difficult one because I just downloaded a whole bunch of new animated gifts because last time someone complained that they had already seen the animated gifts before. So I didn't want that to happen again. Of course, if anyone has any suggestions of animals, right? Because I always show animals animated gifts in the break. If anyone has a suggestion of an animal that they really want to have, then throw it in chat, of course, and let me know. Because like for me, it takes a couple of minutes to find animated gifts of a certain species. And then you could have your own more or less break series. But I think the first break is going to be crocodiles because we didn't have crocodiles before. So I will be back in 10 minutes. I will also put on a little bit of music. I think if that is possible because I am not fully set up because of the issues with the start of the stream, but I will put on some music and I will be back in 10 minutes. So we'll continue at 2.10 and see you then.