 Welcome back everyone. If you're watching this on YouTube or if you're watching this on Twitch, welcome, welcome, and part number two. So the summary from the first part was when you design a primer, make sure that it's unique, that it's like 17 to 28 base pairs. It has the same amount of A's and T's compared to C's and G's. Don't make primers which are AAA. Make sure that you have high stability at the five prime end, low stability at the three prime end. Make sure that the melting temperature is between 55 and 80 degrees Celsius. Make sure that because you're using always two primers, one forward, one reverse, that they have the same melting temperature and the same annealing temperature and minimize secondary structures. Good. So the rest is going to be advanced. And by advanced, I mean not so much advanced nowadays, but it was really advanced when, like the 1990s when people started doing primers. Although multiplex PCR still used a lot when you do sequencing, but it's generally easier nowadays because you can use a computer to kind of do all of these things for you. So I wanted to do four types of advanced primers. So we will talk about multiplex PCR, universal primers, semi-universal primers, and guestmers. And guestmers are my favorite to design because they take more brain power in a way. So they are more difficult to design because instead of having a DNA sequence at which you target primers in a guestmer, you're taking a protein sequence and then trying to guess what the DNA sequence is encoding this protein. So it's guestmers are more interesting. So multiplex PCR means that you have multiple primer pairs added to the same tube when you do the PCR reaction. So it is good and you are not so much good, but you use it when you want to amplify multiple sites in the genome. And for example, you do this a lot with genome identification. So we used, for example, a multiplex PCR recently when we wanted to sequence the four different casi-in genes that are in goats. So goats have four casi-in genes and instead of doing every sample four times in more or less cereal after each other, what you can do is you can just target all four genes at the same time. So you need for each gene, you need one primer pair. So you need eight primers in total. But the design difficulties that comes in is that the melting temperature should be similar of all eight primers that you are designing. And of course, no dimers may occur between any of them. So you're just a little bit more restricted in the primers that you can design. But multiplex PCR is actually used a lot because generally you're not so much, hey, if you want to do genome identification, if you are, for example, want to design an experiment where you can identify between six, seven different species of goat, then of course, looking at a single region in the genome is not good enough. You need to target like five, six regions of the genome. And then based on these five regions together, you can say, oh, it's a Sudanese goat, or it's a goat from France or it's a goat from the US. Right. So for genome identification, you generally want to look at multiple points in the genome instead of just a single point. So universal primers are a little bit more tricky, right? So you can design a primer to amplify one product. But we can also design primers that amplify multiple products. And such primers are called universal primers. So for example, if we look at the human papilloma virus, this virus is having, I think something like 32 different variants, right? So you have a single virus, but this virus comes in 32 different variants, like the COVID-19, right? COVID-19, we nowadays have the alpha, the beta, the delta, the Omicron variant, and we want to amplify all of these, right? So we don't want to know if you have alpha, but we want to know if you have alpha or Omicron or beta or delta, right? So we want to amplify not just one piece of DNA, but we want to amplify a whole family, right? So instead of having a single target, we're actually targeting four or five things at the same time, or sometimes up to 32. So the strategy here is that you first need to align all of the sequences that you want to amplify, right? So you need to make an alignment where you can use cluster W or some other program that can make sure that you are aligning your target DNA, so your template DNA, to make sure that you find the most conserved regions at the five prime end and at the three prime end, right? So you design a forward primer at the five prime conserved region, you design a reverse primer, and then we of course have to match the forward and the reverse primer to find the best pair. And then of course we have to make sure that it is unique in all of the different template sequences. And we have to of course ensure that it's unique in any possible contamination sources. So let me draw that for you guys. Normally I would draw on the board, but since we don't have a board, I'm just going to draw it for you guys how this works. So let me get a new slide. Very good. Let me see if I can actually draw because I didn't test the pen this morning. So imagine that we have a virus, right? So this is the standard virus, right? But now we have a different variant of the virus, and this different variant of the virus just has a little gap here, right? So it's the same sequence, but it just misses this part, right? So this is one, this is two, and then of course we can have a third variant of the virus, which has a little gap here. And it actually might have a little insertion, right? So piece of DNA, which is unique to this one. So this is number three, that we can look at number four. And number four might be more or less very similar to the original virus, but has an insertion here, right? And then we might have a fifth version of the virus, which is just like this. And the only thing that this one has is that it has some mutated base pairs, right? So the mutated base pairs are little x's, right? So now after we've done the alignment, right? What we will see is that the alignment will come up and we'll make sure that the ending of the sequence and the beginning or the beginning of the sequence and the end of the sequence are more or less the same. And now of course what we want to do is we want to find a region which is conserved in all of these five viruses, right? So if we would look at this, then we would say, okay, so this part here is conserved in all viruses, in all five it's the same. And this is here up to where we have the single nucleotide polymorphism, right? Then here we can see that there's a little part which is also conserved because it also occurs in all of the different viruses. And here at the end we have the same because this is also more or less conserved in all of them, right? So now we want to find a primer which fits in this region. So we want to design a forward primer here, right? So forward primers are generally written down like this. So you have a forward primer which goes here somewhere, right? And now we can actually design a very small reverse primer here or we could design a reverse primer here, right? And of course in this case had the only two which are which are possible is that you would design the primer here. So this would be your forward one and here you would design your reverse primer, right? So it's just taking all of the available sequences. It doesn't matter how many there are. Like I told you guys, like in Huma and Papilloma virus there's like 32 or 38 different variants. So the first thing that you do is you take all of these 38 sequences, you align them and then you try and figure out what the best region would be to design your primers on. So in this case the red regions are the regions which are stable across all of the different types of virus. So you would want to put your primers here to make sure that you can target any of them, right? So now when you get an amplification you know it was one of these five. You don't know which one but you know that it at least was one of these five. So that's generally how these work. So that's kind of the strategy, right? Just simplified by a little drawing. So semi-human reversal primers are more or less the same but now you want to exclude some of them, right? So you want to say that okay, so I know that Happy Faye comes in 32 different types but only the first six are very dangerous for humans and the other ones we don't care too much about. Like if you have the, it's okay that nothing will happen, right? But type one to six are very dangerous and so what you do is again the same strategy. So you align all the different types that you have. You identify a subset of the genes that is more similar to each other than to the others and in this case have we want to look at type one to six which are the dangerous types. So they should match but the other ones, so the other like 25 they should not have this piece. So it's again the same strategy but now we are just going to say well we find the region which is conserved between types one and six but which is not conserved by the other types and this is really difficult because of course like generally it's very hard to find regions where exactly the group that you are interested in is different from the group that you're not interested in but this is kind of the strategy and for the rest the strategy is of course the same as the universal primers. So gasmers are my favorite like I told you guys. So gasmers are when you do not have a DNA sequence. Imagine that I'm working on some weird animal which only occurs on a single tropical island and no one ever worked on that animal before so there is no DNA sequence available. No one actually spent the 10,000 euros to build a reference genome for your species of interest but you do want to learn something about this animal. For example you know that this animal is a well let's say a goat. So you sail to some island you find a goat there which no one has ever studied so you're not sure what the genome of this goat looks like. So you don't have a genomic sequence. What you do know is that probably a lot of the proteins that this animal has are very similar to proteins which occur in other goats. So what you can then do is say well okay so I'm going to take something which every other goat has for example hemoglobin and I'm going to target my primers based on the protein sequence of hemoglobin. So what you do is you take your amino acid wheel right let me pull up an amino acid wheel for you guys so that you so that we can have a more amino acid wheel. All right let's go and look at this in here. So let me show you one of these make sure that we have a no that's not a single picture just want to have a single picture which shows relatively well on stream. So all right so let me show you guys my fire force. And here we have the amino acid coding wheel right so we know that the protein sequence is going to be for example alanine, aspergene, glutamate, glycine so just the first like five here right then of course we can now figure out to do what what the possible DNA sequences are going to be because we know that if we have an alanine we definitely have a g and then we have a c but the third one we don't know right and this is because of the wobble base so the third base which is degenerate in the tRNAs so but we can then write down more or less in in on our piece of paper we can write down okay so it's definitely a g then it's a c and then it's kind of a question mark right then the next one is aspartame so it's again it's a g it's an a and then it's either a u or a c right so we can just write down the most likely DNA sequence of the protein and then use this most likely sequence to to kind of design our primer on right and of course head the spec translation is problematic but it is feasible hey you get a lot of these unknown points in your DNA sequence but different organisms also use different codons differentially but in theory you could write down what is the most likely sequence of the protein that you're interested in in this code that you just found on on some random island right so you back translate the protein sequence using the codon usage table you identify the five prime regions so the beginning and in the end of the protein you identify the regions which are least ambiguous right so where you have the highest confidence and that you say okay so there's like five question marks here and at this region there's like seven so I'm going to design my primer of the on the region where I only have five right so design and match forward and reverse primers like normal but now what you do is you make your primers longer right because the longer your primer are is the more um the more um flexibility you have like a single mismatch in a primer that is 22 base pairs long is very detrimental it will reduce your PCR efficiency tremendously but if your primer is 30 base pairs long it doesn't really care too much about like a single or two mismatches um so you you had you design primers that are like 30 35 base pairs long so that you kind of are sure that it will bind even though there might be two or three mismatches in the region right and of course we use a slightly higher annealing temperature at that point because we want to make the primer annealing more stringent so we want to make sure that when it binds that it binds but it binds also stronger because there can be a couple of mismatches so gizmos are really interesting and we don't use them a lot anymore because it's so cheap nowadays to just sequence the animal get the whole genomic sequence and get kind of the the just design primers on the dna um but we we used to do it a lot especially for new organisms right so for for example arabidopsis we have all these different ecotypes all across the world and if you would find a new ecotype then you would not know the exact sequence of this plan but you would know more or less okay so it probably has part of the genes that are necessary for photosynthesis and these kinds of things so you would design primers based on these kind of assumptions that you had for the for the animal and this is this is more fun because it's not just automated clicking around in a computer no you have to really sit down use the codon wheel write down the possible protein sequence and then write down the sequence in the dna and then design your primers based on that so it's more of a trial and error process and it it requires more thinking all right so that was the summary for the advanced primer part so you can do primers which have multiple purposes so you can do things like multiplex pcr right friends are amplifying not just one region of the genome but amplifying like five or ten regions at the same time nowadays multiplexes are actually quite multiplex so for one of the studies that i'm currently looking at we are actually using like 2 000 primers in one reaction so we're targeting around 900 regions in the genome at the same time so we have 2000 primers and all of these primers need to be more or less adjusted to each other so if you add a new primer head then you have to go through all of the old primers and see if they form dimers so this is really where you need to use a computer to kind of figure out if adding two new primers will upset the whole reaction or if it's just perfectly fine to add them so if semi universal primers and we have gasmers and there's many different other types of primers but these are generally the three that are done the most and so there are very there's a lot of fields where primer design skills are required so it doesn't matter what kind of a master you do or what if you want to do a phd later on generally if you do a phd and your phd has a part in the lab you will be a or you will need to be able to design primers right so head things like real-time pcr so measuring the expression of genes but also population polymorphism studies where you look at different populations which are in different areas head then you also use primers and you can also do internal probe design for example if you're designing a new microarray you're also designing primers right because a microarray is nothing more than a glass plate with a lot of primers on there but the very basic rule remains the same everywhere achieve the appropriate hybridization specificity and stability so make sure that you target one region and only one region and make sure that when you have targeted that region that the reaction is more or less a stable reaction so that two primers that you're using have the same temperature they do not bind to each other and they do not form any weird hairpin good so next step will be just me showing you how to design primers how to use different tools that are available so how do i use bioinformatics to do my own primer design and there's a little bit of blah blah about databases and then these kinds of things but i'm going to skip through that so again i hate this i hate this i hate this i hate this microsoft word is such a bad program because it actually just right it doesn't recognize that the database is called ensemble without the e it just thinks that you're talking about an ensemble so people playing the trumpet together um so it always will put the e behind ensemble like automatic text control is horrible sometimes all right so um i wanted to tell you guys a little bit about databases and about basic functions of ensemble and then finding a genomic location like how do we now extract the sequence um that we're interested you should add it to your dictionary yeah it definitely should because it goes wrong in papers all the times as well that i say oh we downloaded the sequence from ensemble and then the e is there again um but so how do we find a genomic location that we want to amplify and how do we export sequences and how do we then use the different tools that are available to do primer design for you guys so genome browsers are ubiquitous right if you go to ensemble um we see a genome browser um let me just pull up a little little fire foxy for you so this is how the ensemble side looks right and then if we go for example to mouse and we just search for a gene so favorite gene of interest um right then when we search for that then we can click on the first search result and then of course we have here the component which is still loading which is the the genome browser right so and it takes a long time to load apparently one of the things that i don't like about ensembles that it always defaults to the normal htp site and not the htps but this is a genome browser right so a genome browser is there to show you the whole genome or parts of the genome and it shows you where the genes are and it shows you all kinds of other information like where are variants in the in the sequence and what kind of phenotypes are associated and where are things like regulatory regions and the nice thing is you can actually add a lot more things like so here we see a little ctpsite um and and all of these things right so but that is more or less what a genome browser is so a genome browser is there to annotate and visualize the whole genome of part of the genome hey it allows you to use different scales and zoom in and zoom out um and it allows you also to have a coordinate system so to talk about where in the genome we are talking about right so a coordinate system means that we can say that our gene is located at chromosome one at two million base pair and of course it's integrating all kinds of different sources of information right like the the ensemble browser that we were just looking at um so here we have the information about the sequence somewhere all the way at the bottom no the sequence is not shown because i'm zoomed too far out um have it here we see for example the genes and the genes have come of course come from a different database at the context which are the uh the the build up fractions of the genome come from a different database then here we have the genes which again come from a different different data source and we have then the sequence variants which generally come from dbSNP we have the phenotypes which again come from a different database and so a genome browser is there to integrate all of these different um all of these different information sources and visualize them to you in an easy way so that you can just say okay so there's my gene and then there's a regulator and then there's a like binding site for a histone or something like that right so in the end you are looking at web services oh this is very poorly visible but you are there you're looking at a web server and this web server is connected to a database and of course there's many many different databases that get integrated into a single website when you're looking at it and of course all of these databases are filled with data which is done in a lab right so there's literally hundreds and hundreds of groups or perhaps even thousands of biology groups around the world all doing experiments they feed their data into these databases and you have access to all of these at the click of a button so when you choose a database there are some things that you have to look at right because not all databases are made equal so one of the things that you are needing to look at when you select a database to use in your research is the availability of the database and if it is up to date right if you're using a database which has not been updated since 2010 then the information that you're looking at is not very well not very up to date right so that's what you want you want to have the latest information so hey if you look at the database you always want to look at when the database was last updated but you also want to see if the organism that you're interested in is actually included into the database there are databases like OMIM right on online Mendelian inheritance in men which is a really good database but if you're working on mice it's a useless database because they don't have any mouse information you always want to make sure that the database that you are using provides you with reproducible information that means that the database should allow you to go back in time right and that is one of the nice things about for example ensemble right because ensemble if we just switch back again right if I go all the way to the bottom it says here this is ensemble release 105 this release is December 2021 but if I click here view an archive site it allows me to go back in time so it allows me to look at the database as if it were December 2017 that means that I have the ability or ensemble provides me the ability to redo an analysis like it was 2017 and of course papers which are published in 2016 or 2014 right they use this version of the database so if you want to reproduce their research then often in the newer database you you can't you have to switch back to the old version to get the exact same results because our knowledge about the genome is continuously like improving or it's continuously updating right it's the same as the reference genome for mice we are now currently at version mm 11 so the mouse genome is currently at version 11 but a lot of people are still working on version 10 because they started writing their scripts and or they started doing their analysis based on version 10 of the of the genome but with the new update everything moved every gene moved a hundred two thousand base pairs right because the we have more information right so we used to have like a little gap in chromosome one this gap is now filled so now all of a sudden chromosome one became 10,000 or 20,000 base pairs longer so that means that the positions that you mentioned if you write a paper you should always not only mention the position of your gene but also the genome version that you are using because the same gene can be located at a very different position when the database updates to a new genome version and that is why ensemble is such a good database because you can go back all the way in time and you can and you can go even back to 2009 and you can pretend like it is 2009 and do research so if you have a paper published in 2010 then it would have used this 2009 database so you can redo the research right and that's one of these things which is very important is that the thing that you do is reproducible so the old data sets and old databases should be available it used to be that the location of the database was very important as well nowadays it's less because now everyone has broadband internet and all of these things but still transferring a lot of data from China to Europe is going to take time and it's going to be difficult to transfer like a couple of terabytes or a couple of gigabytes from China all the way to Berlin right it the same thing holds for transferring stuff from Haiti right it's an island they have very poor internet and there's only a limited bandwidth so the location of the database used to be very very important and as a European researcher like 15 years ago you could not easily use Chinese or Japanese databases nowadays with the cloud and databases being replicated in different amazon web server zones and stuff it's relatively easy but then the location of the database used to be a lot used to be important you want to actually look for what is available software wise do the does the database provide any analysis tools that you can easily use and of course there's a lot of personal flavor like what do you like about the database some people like ensemble some people like the ucsc genome browser i'm a big ensemble fan and almost never use ucsc they have the same genome but just a different way of of of reporting the data and hit like i just i'm used to ensemble i can work with ensemble so using a different database for my genomic information is possible but would take me a lot of time to get used to it like i said we have things so these are three big databases which provide genomic information and this is ensemble here you see ucsc and here we have the map viewer from from ncbi same data just a different way of presentation and so each database has their own advantages and disadvantages ensemble we already saw so drilling down a little bit in ensemble there are some things that you should know right so if we look at a certain genome the thing which is important to mention in your paper is the assembly right so the assembly if we look at cows and i don't know if this is the current building cow but when i looked last time on ensemble then the the the genome build for cows was done in october 2007 and the genome is called beta four which means that there was also a version three there was also version two and there was also version one right so if you write a paper make sure that you mention which database version you used there's some statistical information like how many protein coding genes have been detected how many pseudo genes and these kinds of things but the most important thing is the version of the database that you are currently using so when you search in ensemble um hey you can search by things like gene symbol database id but you can also search for position in the genome right so if you're just interested in a certain region because you found a qtl there right like we did a scan at the beginning hey we found two of these peaks um so of course then hey we we can just search by the position on the genome and then see which genes are there and there's a lot of ways of combining these searches but i think that nowadays with google everyone knows how to use things like and and or and quote things to make sure that you search for the whole thing and not the three individual words and stuff so the nice thing about gene symbols is that nowadays and this is a very big difference from the old days is that up until like 2007 the same gene in humans could go by like 15 different names and this is still a problem when you're looking in old literature um because some people would call the gene um well let me actually get you an example um let me search for not so well-known gene um let me show you guys my firefox window right so if we go to mouse and we search for isle air two or something right i don't know if that's a gene but just take uh let's limit ourselves to mice um novel transcript yeah novel transcript just have one name like let's look at rsp rps 3 no idea what kind of gene it is um but it is really really slow today all right there we go so here we see that rsp 3 rps 3 is also called d7 ertd795e but also rs underscore 3 right so if you are interested in the literature about this gene then you have to not only search for ribosomal protein s3 you also have to include these two names in your search in your literature search because these are the same gene however it it some of them actually have up to 15 different names so during the last like 50 year of publishing about genes um had the name of the gene changed and and other people called it differently because they were they thought that they were working on another gene but it turns out it's the same gene as what the other group was working on and now you have two names for the same gene and these people have been publishing for five years the other have been publishing for 10 years and so now you have like a mismatch in the literature so when this came to light and this became a much much bigger problem over time um there is a uh they formed or there was a committee formed which is called the human gene nomenclature committee so when you go to www.geneames.org there is a single approved name for each gene in the human genome now right and this is a massive advantage because you now have a unique reference for a gene in scientific articles um it is easier to search for genes in a in a database and the nice thing is is that this committee actually made it so that gene families are now named the same so if you look at uh cytogram proteins they are called sip and then a number if you look at different homeoboxes then they are called hawks with a number right so you can directly identify oh these genes are all homeoboxes right so they all have kind of a similar protein structure they all have a similar function and so they belong to a families and because of this human genome nomenclature committee we now in other species kind of refer to their naming so we have uh orthologous genes so genes which occur in humans and and for example in pigs we can now have a single gene name to talk about because the the gene name in pig is more or less standardized towards the human gene name right but it used to be a massive massive mess and depending on which species you are working at it is still the same thing i told you guys about goats goats have a very poorly annotated genome many genes are just completely missing from the genome since no one has ever investigated them and the same gene again can have different names there are sometimes different names for the same protein and this is just a massively confusing mess and and this human genome nomenclature committee so the hg and c actually said at a certain point or they were they were created because of this kind of confusion in the literature and to kind of solve this in a in a single way all right so i wanted to show you guys a little bit of an example this is abcg2 so abcg2 when you search for it um let me just instead of showing you the slides just do this live right we can just look for that so i wanted to search for cows right so just go here um select cow which starts with a c somewhere cat oh crap i just missed the cow all right no don't go to bush babies go to cows cow cow cow chinese hamster sea lion cat cow there we are all right so this is the cow genome so the first thing that we can see here is that the cow genome has been updated because the current assembly is called ars ucd 1.2 so it's not called boss towers 4.0 anymore that's the previous or the one before um but the new genome assembly is called ars ucd uh ucd 1.2 we can look at the cardio type right so we see that cows have um 29 autosomes they have an x chromosome and they have mitochondria right if we then look here then we see indeed that this is a relatively new genome because this genome was constructed april 2018 um we can see that it has a 2.7 billion base pairs right so so just some basic information about the genome all right so when we go back to cows we want to search for abcg 2 abcg 2 um and just see so that's a cattle gene um so all right so now one of these things that we can see right is that i made my slides on the previous version of the genome database so this gene is located on chromosome 6 at 36,475 uh 36,475,377 base pairs right that's where the gene starts however the gene used to be located also in chromosome 6 but at 37.9 megabases right so that's almost a million base pairs away so so from one version of the genome to another version of the genome this gene shifted almost a million base pairs right and that is why it is so important to always mention which genome builds you're working on otherwise people might say like but the gene that you're talking about is not located at this position anymore all right so let's go back so here we see um in in ensemble we see a whole bunch of clickouts links right so we have things like comparative genomics uh the gene tree we can look at the orthologs the paralogs um which we already talked about and so the first thing that i wanted to show you guys is the um ortholog uh the gene tree i think let me see not not the gene tree i want to look at the orthologs it is relatively slow today actually uh paralogs genomic alignment um where did i want to look at because i wanted to show you guys something interesting about this gene let me see we want to go to the gene tree that's probably what yeah right so here we see the gene tree right so we see here that this abc abc g2 gene in cows right it has this structure but then when we look we can actually see that the same gene occurs as well in bos indicus which is not the standard cow that we know but the cow from india um you see that american bison has the same gene as well uh oxens have it as well caprinae which are goats and other like animals with horns um they also have the gene but when we then look then we see that something interesting right because the same gene like very similarly occurs in in wales in wales like how the hell are wales like closely genetically related to cows and this is something that is interesting because like because this database has a as a range of species it actually gives you kind of a tree of life right so we can see actually that here that the genetic relationships between cows and whales and dolphins and stuff because we also have the the cessations in here um so we can also see that they are relatively well related to hedgehogs and based on the gene structure you can see that sperm whales and cows are actually like highly related to each other which is which is kind of this interesting finding right um but if we want to learn a little bit more about this gene then we can see here the transcript table right so when we show the transcript we now see that this gene has seven different transcripts right so this single gene codes for seven different proteins um again let me show you the old version of the database the old version of the database only had two of them so like the presentation is not from that long ago but then you can see that every time that the database gets updated new information gets added so hey if you would use the new version of the database to answer the question how many transcripts does this gene have then you would say seven while using the old database you would say two unfortunately both of these transcripts code for a five hundred and uh 58 uh 658 amino acid protein but the new version of the database hey you can already see that it has a lot more different variants and they are not that well defined because here at the biotype you see this color right and the red color means that there's not a lot of certainty about each of these transcripts but we know that at least one of them should be really true right so if we look at the transcripts um then we have to go to the transcript page so if we go to transcript comparison right then what we the transcripts selected um select transcripts where's the select transcript they make it is hard you must select transcripts or by clicking here all right let's click there all right let's do let's compare a couple of these right so just add all of them then press go and then it will um start comparing all of the different transcripts uh this is not exactly what i wanted but here you see the alignment right so when we talked about these um designing primers for multiple human papilloma virus and this is the first thing uh that you need to do is that you need to align these sequences towards each other and so here you see the different uh transcripts um so all seven of them and you see here that if i want to target transcript one two and four then i can design my primer here right this will amplify only one two and four um but it won't amplify the other ones if we go down a little bit more and then we see that at a certain point we should start getting in more of these uh sequences um or not because it might be yeah so for example here if we want to target the read or if we want to detect the presence of not just one two and four but also the one three then we can actually just design our primer here and head the further we scroll down the more overlap between the different sequences there would be um but that's that's how you can use this to kind of use the alignment to figure out where you should put um your um your primers right so this is what you can um use the database for because the database has a built-in aligner which can align different transcripts for you so if you're interested in amplifying like a couple or you want to amplify all of them right then you would say no we want to design our primer somewhere around 81 thousand base pairs into the gene because here we see that all of the transcripts have this piece of DNA inside of them so we can actually target all of them at the same time all right so of course ensemble is a massive database so definitely just click through it hey you have things like the external reference page where you can go to literature or to uniprot or to wiki gene um wiki gene is also really nice because it's kind of an open editing um an open editing wiki platform per gene um so about every gene people can write something and they can add information um the genomic alignment show you the location of the gene and the sequence and the phenotype button shows you which phenotypes are associated with this gene and I think this is the region why I chose this abcg2 gene let me scroll all the way up in firefucks so let's go to phenotypes uh non-found interesting interesting so there are some some phenotypes from mouse which have been associated with mice but I think the one from I think this gene was very important for like milk production in cows but it doesn't matter too much anyway so and generally what we want is we want to see single nucleotide polymorphisms right because these things will allow us to do things like track origin right if we have because if we would just amplify a region of the genome or amplify a region of the gene right then it would be the same for goats from sudan versus goats from italy but the single nucleotide polymorphisms that they allow us to then PCR out a piece of the genome and then sequence it to see which letter there is and if we know that all sudanese animals have a g while all italian animals have a t right then we can use the results from sequencing to say this this animal which we have never seen before is probably of sudanese origin or it's probably from the origin of italy right so if we want to do genome identification we want to know SNPs so in the database let's just go back to the database we can go to the variant table and the variant table will show us all of the known variants in this gene right so here we see that at this variant at this position um there the reference genome is a t and the alternative allele is a g right it is a SNP and it comes from dbSNP so that's all perfectly fine and you have like the evidence so for example the frequency um for some of them and so for some of them there is known frequencies in different breeds um right so here we can see that the minor allele frequency in the highest population is 0.6 and we can actually go to genes and regulation but also to population genetics um i don't know if this one is in multiple populations um no so it's only in in one population measured yeah but if if you have for example goats that or if you're interested in cows right then some SNPs have a very well characterized kind of population genetics saying that well if this is a t then it's highly likely that it comes from asia if this has a c then it's highly likely that it comes from the us um so but this is this is not the case for this one because this one is just classified in one all right so we can then look at all of the all of the SNPs right and generally what you want to do is you want to filter um the consequences right because we are generally interested in only SNPs which change the which change protein right so which are called missense variants so we can say uh turn everything off um and then only give us the missense variant in this gene right so these are missense variants which which change the protein um has so instead of having a certain amino acid these mutations make it so that you get another amino acid and this of course has a as an influence on the function right so in this case um let me go back to the presentation we want to look at a certain one so it's a a missense mutation means that it changes the amino acid we can get we use dbSNP to get more information but in this case we want to do rs43702 43702 can i just search for that 43702 search all right so let me go back to firefox um so here we see the SNP right so the SNP is in there seven times because it affects all seven transcripts of the gene right but what we can see is that it's located at this position um it's an uh it's an at SNP right so the reference genome has an a some animals have a t um there's a lot of evidence has so it's actually been cited in a lot of publications and um we probably have also multiple observations across multiple populations right so here we have one of these SNPs which is relatively well studied which we can use to figure out where our cow comes from right and um hey it changes the y amino acid to an f i'm not having a complete amino acid table in my head but it just and it's it changes the protein at position 581 right so um imagine that we want to pcr out this very well known SNP right because it's it there's there's publications about this and there's things like multiple observations and so in the next part in the next part of the lecture um we will go through all of the steps that you need to do to extract this piece of the genome from the cow genome and then send it in for sequencing to get which SNP is there all right so that's it for this hour so let me uh oh yeah people on youtube um thanks for staying until the end and see you in part three of the lecture so