 All right, so welcome everyone. Welcome, welcome. If you're watching this on YouTube or if you're watching this on Twitch, welcome to lecture number nine, I would say, from the top of my mind. It's in the description, but I can't see the description myself at the moment. But yeah, today we will be talking about primer design. And primer design is one of these lectures that I love, especially because of the inventor. But of course, first, let me pull up my sheets and let me show you guys what we will be talking about. So the first thing that we will be talking about is polymerase chain reaction. And I will be talking mostly about what is a good primer, when is a primer and primer and some advanced primer design techniques, which I really like. So we will be talking about things like multiplex PCR, where you amplify not one piece of the genome, but multiple pieces of the genome, talk about things like universal primers and gasmers. After that, in part two, we will be talking about databases like genome browsers and basic functions of ensemble. Let me actually fix the typo. I hate the typo. Ensemble. That's it. All right. And then how to find a genomic location, export your sequence and then use a computer to the primer design, because of course, you can design primers by hand, but it's much better to have a computer do it for you. But of course, first, like always, we will have the answers to the previous assignments. So let me pull up the assignments. And the assignments is, of course, from lecture number eight, the phenotypes and QTL mapping lecture. So I actually forgot to put it on Moodle and also on my website over the weekend. So I only uploaded the assignments on Monday. So I hope everyone still had enough time to at least look a little bit into the assignments. Again, we were using R. So it's kind of a follow up to the R lecture, right? So to kind of have something to work on. So let me open up the R window for you guys and let me open up plus plus so that we can more or less switch to and from. Okay. So during this lecture, we will be using two different files. So one of the files we already saw. So one of the files was this file, which we already saw during the phenotype lecture, I think. And this has 24 different metabolites measured. And metabolites, of course, they come in a wide range, right? So you either have almost none of the metabolite detected or you have really high amounts. And of course, this varies for different individuals in the same population. So here we have a data set, which has in total got 162 individuals. Of course, some of them have nothing measured. And we have 24 different metabolites that were determined using mass spectrometry. Besides that, we have our genotypes. So as you can see here, our genotype file is nothing more than a whole bunch of markers. So PVV4 is the name of a marker. But we also have GD86L as a marker. And this file, because it is a recombinant inbred line is filled with ones and twos. So a little bit more about the data set. I don't think that that was in the assignment, because in the assignment, it just assumes that you know what a genotype file is. But of course, this is a cross in Arabidopsis, which was made by Columbia, which is one of the ecotypes of Arabidopsis, which comes from Columbia, of course. And between Landsberg Erecta, which is a different ecotype, which I am not exactly sure where it comes from. I should know, because I did a lot of Arabidopsis work, but I forgot. But anyway, so here we have these two files. So on the one type, we have the phenotypes, so the metabolites that we measured in the different plans. And we have these genotypes, which at every marker in the genome, so we have a couple of 100 markers, I think, has a determination if for this individual, the piece of DNA came from Landsberg Erecta, or if it came from Columbia. So I think one is Landsberg, and two is Columbia. So individual one at the first marker on chromosome one has a one, which means that this part was inherited in this individual from the Landsberg Erecta. Well, for example, individual three has a two, which means that they have a Columbia origin for this marker. So first things first, I'm just going to read the question. So for this exercise, we use the phenotypes, which we use before. Additionally, we need the genotypes file. First, we want to find the amount of missing data per marker, right? Because if a marker has a lot of missing data, then we probably don't want to do the association with this marker. So we just want to throw it out. We can use the is.na function for checking if a marker has na's and how many na's there are. So the isna function will return a list of true and false values. We can sum up this logical vector to get the amount of missing data. Using a for loop, we can loop through the markers or individuals in our genotype data, and using the c function, we can create a vector of missing data. In this example, we need to add our own code to compute the amount of missing data. So there's just a standard kind of little example of you guys to start working with. So the idea here is is that we first load the genotypes and phenotypes. So let me do that. Of course, if you start an R file, always add a header, right? So again, mention that this file is containing the answers to the assignments of lecture number eight. It was written by me. And of course, then the first thing that I do is move somewhere on my hard drive where I have stored my files. So as you can see, these files are on my one drive. And then I just load in the phenotypes and I load in the genotypes file, right? And in this case, because when I look at the file, I can see that they have a header, right? Because the every column has a column name. That means that I have to specify that in the call when loading it in. Furthermore, if we look at the file, right, we see all of these little yellow arrows. And because of these little yellow arrows, that means that they are tab separated. So and tab is specified by slash t. So let's do this, right? And let's go to R. So let me open up the R window for you guys. I need to move it out of the way for myself. But just let's load them in, right? So if we want to know how many genotypes we have, right, we can just ask for the dimension. So the dimension of the genotypes is 162 rows, 117 columns. That means that there's 162 individuals and 117 markers that have been measured. I always look at a little piece of the file just to make sure that everything was loaded in correctly. So I generally just say, well, from this matrix, select the first 10 rows, select the first 10 columns, and print those two screen for me. And that all seems fine, right? There's row names, there's column names. So it kind of understood the file. Let's do the same thing for the phenotypes, right? So if we look at the first 10 rows of the phenotype, we see that it seems to be loaded incorrectly, right? If we would go back to notepad and we would go to the phenotypes, then we see indeed that there's x3 also. So it seems to be all perfectly fine. All right, so then the first thing was is to figure out the amount of missing data, right? So let's go back to notepad. So here we just say missing data, right? So we need a variable to store this. So we have for every marker or for every row, we are in this case for every marker. So for every column of the genotype matrix, we want to figure out where or how much missing data there is. The annoying helicopters are back. I hope it's not too loud. I don't have the window open, but they're flying like really close over the building. So I hope it's not too bad in the back. But tell me if you can hear them. Good. So it's just me. Like there's just helicopters in my mind, probably. You're nothing. Okay, okay, then that's good. That's good. All right, so what we do is we say missing data is nil, right? Because initially we just want to create a variable and this variable holds nothing. And then we just say for x in one to the number of columns of the genotypes, we could have also just used the column names, right? Because every column is a column name. But in this case, I'm just going to use the standard for x in one to the number of columns. So x, first time that we go through the loop, we'll have the value of one, second time it will have the value of two. So what do we do is, okay, so we do a one liner. So the first part I gave you, right? So I'm saying that the new missing data is a combination of the missing data that we already had. So the first time that we go through it, it would be would be empty. And then we need to calculate the percentage of missing data. So the percentage of missing data is actually quite relatively easy to calculate, you just say, well, take from genotypes column number x, that's this part, then ask is na, right? So this will give you back true and false for every marker or for every individual, right? Because we're looking across a single marker. So if one individual is missing, it will say false, it will say true, because we're asking, is it missing? And for individuals that have a one or two there, it will just say false, right? And since true is one and false is zero, we can just sum this up. And then of course, we need to divide by the length to get the percentage. We also need to multiply by 100. So this is just calculating the percentage. And it then directly adds it to the missing data vector that's growing with each loop. And then of course, we want to plot this data and to kind of show that how much missing data there is per marker. Of course, we need to give it a y label and an x label and a main. So on the x label, we say marker. And on the y label, we see a percentage. So let's just run the code, see what happens just so that you guys can see the nice figure. And so you can see that this data set was actually genotype pretty well, right? We see here the 120 markers. And we see that on average, most markers are 100% genotype. There is not no 0% missing data. We see that for some markers, there's like 3% missing data. But this is all perfectly acceptable, right? If generally, I put my threshold for missing data at like 5% or 10%. So if more than 5% to 10% of the individuals is missing, then I kind of throw out the marker because there's too much missing data. But of course, this really depends on what kind of a data set that you're working with. But for QTL mapping or for genotypes, because these genotyping arrays are really good, having more than 5% of the markers be not determined or missing is generally an indication that something is going wrong at that marker. Alright, so this is the missing data per marker. We also ask for the missing data per individual. And of course, this is the same thing, right? It's just that we now go not through the columns of the genotypes, but we go through the rows because the individuals are in the rows. And the only thing that changes is this part here. So instead of taking the column from the genotypes comma x, we now take the row. So we say genotypes x comma. And for the rest, the code is exactly the same. So we can then just copy paste the code in and then show you guys what happens. So then it looks like this, right? So now we see that most individuals have between like 0 and 3, 4% missing data, there are two individuals which have a large amount of missing data. So these individuals 7 or 6% of the data is missing. And then this one 7% of the data is missing. So we could throw those out. But we have, of course, when we throw them out, we have to remember to also throw them out of the phenotype file, right? Because otherwise the phenotype file has 162 individuals and the genotype file only has 160. But so we have to keep these two consistent with each other. All right, create two plots. Use the main xlop, okay, calculate the average percentage of missing data in the genotype matrix. So that's up to you guys to do. I didn't do that assignment. The average percentage, of course, is nothing more. Let's do it, right? So let's go back to R. So let's run the code for the missing data per marker. It shouldn't matter that much, right? But now when we take our missing data vector, we have for each marker, we have the percentage of missing data. So if we want to have the average of this, then we would just say mean, and then it would calculate the mean. So on average, there's like 0.4% missing data, which is really good for for genotyping. And kind of what you expect, because nowadays, snip chips are pretty good at determining that. All right, so then the next step is to do a basic effect mapping, right? And this is already really hard, because a basic effect mapping means that you have to take the genotypes, take the phenotypes, align them to each other, and then do a t test between the two groups, or calculate the mean between the two groups, right? So it's a lot more code. I gave you a little bit of code. So the code that I already gave in the assignment was this part. So I gave you the kind of the means one means two. So how, because we want to calculate the mean of the first group, so the Columbia group, and we can want to calculate the mean of individuals that got the Lansberg-Recta genotype, of course, we have to go through each of the markers. And then this part in the middle was the part that I wanted you guys to write. So how does this code work? Well, we go through each of the columns of the genotypes, right? And then when we're looking at a single marker, I'm just going to ask which individuals, so which of this marker, like genotypes, is is one, so has an origin of Columbia, and which genotypes comma x is is two, which means take the x column from the genotypes and then see which individuals are two. And then I just store this in two variables, right? So I take a variable which is called E and D in one, so individuals in one or individuals who are one, and E and D in two individuals who have an origin two. And then I'm just going to calculate the mean, right? So I'm going to say take the mean of the phenotypes of the individuals of one, and then just do x, right? Not one, and do x here as well, right? So from the phenotypes matrix, oh no, for the first phenotype. Yeah, so just for the first phenotype, so not x, so just for the first phenotype, right? So from the phenotypes matrix, I take the first column and then I ask individuals in one, so take only those individuals out of the matrix and then calculate the mean, make sure that you remove an ace, right? Because you cannot calculate a mean when there are missing values, because then the mean will also be a missing value. And we didn't look at the missing values in the phenotype, so because I don't know, I'm just going to make sure that I'm when I calculate the mean that I ignore the missing values. And then of course I just add this to my means one vector. I do the same thing for the second means, now I just use the individuals in two. I hope that's clear, like it's just a kind of basic selection from first matrix and then using the selection in the second matrix, right? So when we run this, then it will give us the means one and means two. So let me show you guys the R window again. So when I run this, it's very, very quick. But now when I look at means one and I look at, for example, the means of the first 100, not 100, first 20 markers, right? Then I see that at the first 20 markers, the average of the individuals who had genotype one, so genotype from Colombia is 3,543. If I look at means two, right? At the first 20, then I see that, oh, that's interesting. The individuals who had a Lansberg erecta allele at this marker, at the first marker actually have a mean of 5,000. So much, much higher. Good. So now if we want to visualize this, we can do this in two ways. But the easiest way is just to plot the difference, right? Because if the difference is big, then we expect there to be something in the genome which is controlling the difference in the phenotype. So let me make that plot for you guys. So when we do the plot, then it looks like this. So we see that at the beginning that there's like a minus 1500 for the first marker, which we just computed. But you can see that across the 120 markers that we have, there are two, perhaps three regions which are relatively interesting, right? At these regions, we see that like the one group is almost 5,000 units higher than the other group. Here there's almost 6,000 units increase. And here there's like 2,000 units increase, right? But we still don't know exactly what is going to be significant, right? Because we only know that there is a difference, but not how significant this difference is. So just a very basic plot. We could actually add the values themselves. So if I would say points, and then say means one, right? So if I wanted to add the means one value, and I would just say color is, for example, green, and then comma LTI for line type is two. Then here you see in the little dots you see that the value of had the first group. And of course we could do the same thing for the other group. If we look at the second means, and then we could see if there's a big difference, right? So we can see here that there is indeed a difference of like minus 1500. And here we can see indeed that the green group is much, much higher than the red group. So there's something at this point, which is segregating, which is linked or seems to be linked to our phenotype. Good. So let me switch back to notepad. All right. So then the next thing that we wanted to do is just do a basic TL mapping, right? Since we have only two groups, we can just get away with using very, very basic statistics, right? Because the very basic statistics that we can do is just do a t test. If we have two groups, we can just say, well, I do a t test and I compare group number one with group number two. So again, I do the same thing as I did before. So I go through each of the markers in the genotypes. So since the markers are in the columns, I'm going to one through the number of columns. And I define a new variable called p-vales, which will store my p-values, right? So it's just an empty vector at the beginning. And every round or every time that I go through the loop, I will add one p-value to it. And the p-value is nothing more than saying, do a t test between, well, from the phenotype matrix, take the individuals who are one in the genotype matrix and then take the first phenotype and then compare this to the phenotype matrix, where I select the individuals who are two and then take the first phenotype as well. And then do dollar p-value. So this gives you the p-value associated with the difference. And then of course, when we make a plot, then of course, we don't plot the p-values. And I will show you actually why. So let me just run this in R. So I'm running this in R. And I would just plot p-vales, right? Then I would get a plot which looks like this. And of course, I cannot see the difference. Like, is this significant? Right? I can say upline, give me a horizontal line, which is at 0.05 divided by the number of markers, right? Because that is my significant threshold. And then you see that there's a line here being drawn. But I can't see if these dots here are below the line or if they're above the line. Right? So that is the reason why we use this minus log 10 transformation, because by using the minus log 10 transformation, if you have one times 10 to the minus five, it becomes five. So it becomes really visible, right? And 10 to the minus three becomes three. So hey, you can see the difference between 10 to the minus five and 10 to the minus three. And here you can see the difference between these values, right? Because have you know that this is like one times 10 to the minus one, zero point or 0.5 times 10 to the minus one. But what these values are, you can't see because they're all squished together near the zero p-value, right? So when I do the plot and instead of taking the p-values, I now take the minus log 10 of the p-values, right? And then I see all of a sudden this. So I see that these values here, which used to be all the way at the bottom, are actually very significant, right? So it's 10 to the minus seven and 10 to the minus 15, 10 to the minus 16. So those are really, really significant differences. And of course, I can again add my line, right? So if I wanted to add my significance line, which is 0.05 divided by the number of tests that I did, because I have 120 markers, I can now just do a minus log 10 of that as well. And then I would say that color is green, right? Because everything above the line is significant and everything below the line is not. So what we see here is that these two regions that we now identify, they are significantly associated with the difference in our phenotype. So we just found our first two QTLs ever. And so now, of course, the next step would be is to go to ensemble, see where this peak crosses the line, right? Because this is the interval where we expect the gene to be in. And then we would go through ensemble, we would download all of the genes in that region, and then either use literature to kind of filter down which genes there are, or use some other strategy like looking at gene expression of these genes to kind of see if we can find a gene where the expression of the gene correlates with the metabolite that we're interested in. Of course, this is a very basic plot. This plot can be improved a lot, right? There's no header at all. Here it just says minus log 10 p-vales, which might just be better to say lot score. Had this axis here is a little bit weird because the numbers are turned. But these are all little things that you can easily fix in R. And that just takes a little bit more time. But of course, the results won't change because of that. That's why I like R. It's really good for like very quick, very basic data analysis and exploration. All right. I think those were all of the assignments. Yeah, so not a lot, but I hope everyone was able to at least get it working. I hope everyone found the assignments on time, not that people looked on Friday morning and thought, oh, there are no assignments. So that's all fine because they just forgot to upload them. So in case that you are actually looking on Friday morning at Moodle and seeing, oh, there are no assignments, send me an email so that I know that I should upload them. Like sometimes I forget, right? I do the lecture generally after the lecture. I go home and eat pizza. But I tend to forget when I come in on Friday morning to upload the assignments for you guys. So do send me an email if you don't find the assignments. And then if there are no assignments, you get an email back saying that there are no assignments. So just relax. And otherwise, I will directly upload them. Good. All right. So that was it for the answers to the previous assignments. So let's go back to the overview because I wanted to do a shout out. And that's a little bit interesting because you guys might remember me doing this drawing last week or the week before. I think last week, right? So last week, we had Pastor Sauras redeem some of his channel points and asking me to draw a puffer fish. So actually yesterday when I was sitting in an office, one of my colleagues came working in and she said like, oh, my daughter watched your video and she got inspired and also wanted to draw a puffer fish. So this is the image that I got from her. So I definitely got my ass handed to me by a 12-year-old who just made like a much, much better drawing than me. I have it over here, right? So I can show it on cam as well. I'm really, really happy about it. So yeah, definitely if you feel inspired by me drawing and think I want to draw as well, then do. And just send it to the Humboldt University under my name. And it'll be fun. Like I love getting these drawings. So I was pleasantly surprised. I wasn't expecting anyone to be like motivated by my drawing. But apparently someone was and I really love it. Like it looks a lot better. Like the eyes are really good and it's a lot better than my horrible, horrible puffer fish. So yeah, if you think you can do a better job, then definitely do and definitely send it to me because I love getting those things. All right, back to the lecture. Polymerass chain reaction. So, Keri Banks-Mullis, one of my all-time, all-time, all-time favorite scientists. And the reason because of this is, is just because of his kind of life story. And he invented Polymerass chain reaction. So I think everyone knows what PCR is for, right? If you feel sick at the moment and you have to go and do a test, right? So you do a home test and the home test says, well, you might have COVID-19. Then of course, you are going to go to either the Gazuntites Amt or however it's called in your country, like the RIVM or whatever or the GGD. And then you do a PCR test, right? And PCR tests are more or less one of the most sensitive tests that we have to determine if a certain sequence of DNA or RNA-based pair is occurring in a sample that you took. It is one of the most valuable techniques. So it was invented by Keri Banks-Mullis in 1983. And it is, like I said, it is one of these techniques that, after its invention, quickly revolutionized the world of molecular biology and also the world of bioinformatics, right? Because using PCR, we are able to determine things like genetic markers, right? So for QTL mapping, what we just did, we need genetic markers, right? And since we need genetic markers, we need a way to determine these things. And the only real way of doing that up until like 2010, I think, when we got CASP, so competitive allele-specific PCR, which is a kind of update of the PCR technique, you could only use PCR, right? So it is used in cloning, in phylogenetics, in gene analysis, in genetic fingerprinting, like police officers use it when they find like a blood trace at a crime scene. And in 1993, Keri Banks-Mullis got the Nobel Prize in chemistry. But the invention story is really interesting because Keri Mullis was a big fan of LSD, having grown up in like the 1960s, the 70s and the big like sexual revolution and all of the drug that came with it. He was an avid user of LSD and working for the company, actually, there's this really, really good book, right? So he wrote Dancing Naked in the Mind Field and this book is his autobiography where he explains how he came up with the idea for PCR. So he has always claimed that PCR was given to him by aliens. So that on LSD, he was, he had like a bad trip, but in his mind, his belief is that he encountered extra terrestrials and these extra terrestrials according to him looked like fluorescent raccoons and they told him how to do PCR. So he doesn't claim any part of the invention for himself. He just says, no, one night I was just sitting at home and encountered some extra terrestrials looking like fluorescent raccoons and they told me to do this and that would be like a revolutionary new technique. Very, very interesting guy. He is also one of these people that has a lot of controversy on his name. Like the first paper that he ever wrote and if you guys are interested in this paper because I have a copy of it, I can put it on Moodle or on my website. I don't think I can put it on my website because nature is actually quite, quite stringent about his first paper because it's one of these papers that is very pivotal. But the first paper that he ever wrote was published in Nature in 1968 and it is called Cosmological Significance of Time Reversal. So it is a paper in which he outlines his ideas about how time and how time can flow in multiple directions. So not just forward like we all kind of experience every day but he believed in that time travel was possible and that you could kind of reverse time in several different ways. Besides that he wrote like three papers about PCR and then he actually wrote his book Dancing Naked in the Mind Field. So if you are looking for a good book to read over the Christmas holidays, definitely pick up this book. It is an amazing book with all kinds of LSD references and he was a big fan of papers still on Nature. Yeah, but you have to have like an account to get it and so if you're not at a university you have to get it. So if anyone's interested then I can just put it somewhere for you guys to download. It's a very short article. It's like a one and a half page read where he explains his ideas about time reversal and he's just a very, very interesting guy and his whole life story is super interesting. But he is generally credited as being the guy who invented PCR although I think in 1999 courts overthrew his patent. The nice thing is he got screwed massively by the company that he worked for. So the company that he worked for was called, let me look it up so that I don't say it wrong, so the company that he worked for was originally called Cetus. So when he was working at Cetus he invented PCR or not invented because aliens told him that he should do this and that was a good idea. But then he got a bonus. So he got a bonus for inventing this massively important molecular technology so he got a $10,000 bonus which of course in 1983 is not a bad thing but still a couple of years later the company actually sold the patent to Roche and they made 300 million just from selling the patent on PCR. So he got screwed over royally until like 1993 when he got a million dollars from the Nobel Prize committee of course. But still it's really interesting and the whole like lawsuits that followed because another company also wanted to buy the patent. They were unsuccessful so then they sued Roche for saying well this patent isn't valid because there's previous work but yeah he is generally kind of considered to be the inventor. There's some other people who did a lot of preliminary work but he came up with the thermic cycling for PCR so really really interesting person and definitely worth reading. I actually love Nobel Prize winners because most Nobel Prize winners after they get a Nobel Prize they go a little bit crazy but he was already like out there in the left field way way before he actually started publishing and doing his big inventions. All right so Polymer has a chain reaction let's get into it. What do we need? So we need template DNA. We need a sample so you need to take a paper cup from somewhere, steal a paper cup and then see if it has COVID-19 in there or not. So you need template DNA. So template DNA is the DNA from which we want to amplify a piece. We need water. Water is an essential part of PCR. We need to be able to do precise thermal cycling and by precise I mean thermal cycling. So it used to be that the way that people would do PCR is to have just three kind of buckets of water and these three buckets of water had different temperatures and what you would do is you would take your cup with your substances in there and then you would put it into the first bucket and you would just hold it there for like 30-40 seconds. Then you would take it out and you would put it in the second bucket and then you would wait again for like a minute and then you would take it out and you would put it in the third bucket and then you would wait a little bit and then you would take it out and then because we did not have heat stable polymerase in 1983 yet, you had to add polymerase to all of your samples again and then you started over. And that was the whole technique. So we need heat stable polymerase. Nowadays we always use Tuck polymerase from Thermos Aquaticus which is one of these extremophiles which lives near these like warm water sources in the ocean. We need nucleotides so just free nucleotides so A, C, T and G because the polymerase needs this to extend the DNA so you have to have nucleotides in excess and we need something which is called oligonucleotide which we call primers. So the whole lecture is about how to design primers and what is a proper primer. So a primer is nothing more than an oligonucleotide which is like a little piece of DNA around 20 base pairs, 22 base pairs long which has a certain sequence and this sequence is the sequence that you want to amplify. StockFishO22, thank you for following. And of course for PCR to work you need an unlucky student because as soon as you've got anything more than a master degree you're not going to want to do PCR. You just have someone else do it for you right because you're not going to wait for it for three hours just have a student do it right students can pipe it. All right so how does it work? So I told you that you need three buckets of water right and the first bucket of water is supposed to be at 90 degrees Celsius and at 90 degrees Celsius so you take your mixture right so you mix all of these components together your template DNA, your H2O, your taco polymerase, your nucleotides and your primers so you just all pipe at this into a cup and then you throw the cup at 90 degrees Celsius for around one minute. So what happens when you bring DNA at 90 degrees Celsius? Well DNA since the molecule is more or less a double stranded molecule when you heat it up the hydrogen bonds that hold the two DNA strands together they kind of dissolve and the DNA opens up right so you get two single strands of DNA instead of having a double helix so and that takes around a minute right so if you if you have a cup with DNA you heat it up to 90 degrees Celsius then a minute later you have single stranded DNA. Then the next step is the thing that happens at around 54 degrees Celsius and this of course is dependent on the length of your primers but generally like 54 degrees Celsius is where it happens so that because you have all of these things in the mixture when you when you cool this mixture down to 54 degrees Celsius the DNA starts to kind of rebind together but of course the primers are much smaller than the whole template DNA strand right so the primers will kind of quickly bind to the DNA because they are they are just more efficient at binding so the primers will start binding at the DNA and then the next step is to do the extension so the extension is the part where the polymerase binds the double stranded DNA question with RNA do you skip step one no with RNA you do an additional step so you first do a cDNA synthesis step so you take your RNA then you use reverse transcriptase which turns RNA into DNA and then you just do it like you would have DNA and that's generally the way that we work with RNA the way that you work with RNA is just turn it into DNA and then you can use all of the molecular techniques available to DNA all right so in the third step right so we've opened up the DNA we then cooled it down a little bit so that the primers have the ability to bind and then we heat it up to 72 degrees so what happens is that the polymerase binds the DNA and starts extending the DNA just base pair by base pair so hey it just binds here and then it goes forward and every time that it moves a little bit forward it it finds the right nucleotide and incorporates that so it makes double stranded DNA again so instead of having one copy you end up with two copies right so we had well you had two copies of course but instead of having two copies at the end you actually have four copies of the same so PCR gives us an exponential amplification of the target region right so imagine that we have here this little piece of DNA of interest right that's the red part so here we have the whole chromosome and this is the part that we're interested in so the first thing that we that happens if you have one cycle instead of having one copy you get two copies so two to the power of one the second cycle will give you two to the power of two is four copies and so on and so on so generally when we do PCR then we amplify it like using 35 cycles right and 35 cycles means that instead of having one copy of the piece of DNA that we want we actually have like 34 billion copies and this is a lot so this is something that you can see on a gel right so when you when you have a single copy then this is very you can't see anything right but when you have 34 billion copies of a piece of DNA then you can more or less see it on a gel so if you run it through an agarose gel then then you can really see the DNA on there of course this is not exactly what happens because when we look at the first four cycles in detail what happens is because we so we have our template DNA right so we make it single stranded and then what happens is the primer binds the amplification starts but in the first cycle we don't get any of the product because the product is still too long right so we start from the five prime to the three prime DNA is always synthesized five prime to three prime the same thing happens with the other strand of DNA but of course you can imagine that the polymerase doesn't know where to stop so it'll just continue copy and copy and copy so in the first cycle we have no real product of interest in the second cycle the same thing happens right because we now have this kind of longer piece here so this is the green piece with the little piece that we want which was just amplified from left to right and only in the third cycle because now when the amplification starts now the other primer binds because you need a primer for the forward and for reverse so only in the third cycle do we see that we get the first two products which have the exact length that we want and then here in the fourth cycle we go up to eight but from then on it again does exponential so you need to do at least three to four cycles to amplify DNA using PCR so this is kind of how we generally schematically think about it but what happens in real world is that since the polymerase doesn't know where to stop the first fragment just continues on copying until the end or until the it cannot copy anymore and then we have to wait for the next cycle to have the reverse primer go in so it's a it's a little bit more complex in detail so what makes a good primer right because these oligonucleotides the 20 base pair long pieces of DNA which we use to target which sequence we want right because a polymerase can only bind to double stranded DNA we need to have this little piece of like 20 25 base pairs which targets the DNA that we are interested in so what is a good primer well a good primer means that it lacks secondary primer sites so that means that if I have this 25 base pairs of DNA it can only fit in one position in the genome and not multiple it also needs to have a melting temperature between 52 and 65 degrees so we cannot make this piece too long right because the longer the piece the higher the melting temperature because the higher the temperature that it needs to properly uh kind of bind to the art to properly bind to the DNA right so if we would take a primer which is 70 base pairs long then the melting temperature would be around like 80 90 degrees Celsius and because we have these three temperature points at which we do PCR right so we we start off with the first step which is the denaturation then we have the annealing so if the annealing temperature is too high right if the annealing temperature is around 72 degrees then of course the annealing doesn't happen properly and the extension already starts before the annealing part so have we we always need to make sure that when we design a primer the primer has kind of a melting temperature and we will talk about melting temperature what it is in detail but it needs to be between 52 and 65 degrees there there needs to be the absence of dimerization capability so that means that the sequence of a primer is not allowed to be able to fold back on itself and and make like a little hairpin um head like uh or or hairpins or dimerizations so two primers are not allowed to stick together or stick to themselves um and then we always want to have a little bit of low specific binding at the three prime end so when we design a primer we want the three prime of the primer so it goes from five times three prime the three prime needs to be a little bit loose or looser than the rest so we generally want the primer sequence to end at like an a or an at or at a um so not with g g g right because um an a and a t have two hydrogen bonds while a g and a c have three hydrogen bonds so a g just binds much stronger to a c um than an a binds to a t so when we design a primer we generally want the primer to at the three prime end have a little bit more a's and t's than it has g's and c's um and this just has to do with the way how polymerase works because when polymerase starts um extending the dna um it it needs to incorporate the nucleotides and if it's if if you have a dna strand which is tightly bound together then it cannot incorporate the first base pair so have we have to have a little bit of a lower specific binding at the three prime end so let's go through all of these right so the first thing is uniqueness there shall be only one target site in the template dna where the primer binds right because otherwise this whole technique is not working because we want to amplify a very specific part of the dna um so had the primer sequence shall be unique in the template so we cannot design a primer which occurs like 10 times in the genome no we are very interested in in a single gene and of course we're targeting that gene by making the primer long enough to be only binding at this position in the genome furthermore of course we don't have to consider just the genome that we're interested in right if we do um amplification of for example uh human well not human dna but if we do amplification of mouse dna right then me as a human researcher is working with this dna so when i'm just scratching my head a little bit of my own cells fall off so that means that i have to make sure that the primer binds to mouse dna specifically so that the the sequence that i'm targeting is the sequence in the mouse genome and that the primer is not accidentally also binding in my own dna or in dna of contaminants right because bacteria can come into your sample yes so you have to make sure um that there is no site where the primer can bind in things like mouse or human or rat so you need to make sure that you have to that your primer is specific for the species that you are working on and that any possible contamination sources like humans are excluded and head you can use blast so um we will talk about blast in a later lecture but that's just an alignment tool where you can take a little piece of dna and see if it occurs once twice or a hundred times in a certain genome so the length of the primer has an effect on the uniqueness and the melting temperature right the longer you make the primer the higher the chance that it's unique right if i'm looking for 12 base pair into the genome then that probably occurs at multiple sites but if i'm looking for 50 base pairs right then of course the chances of 50 base pairs mapping to five or six different positions is very small um so head the longer the primer the more chance that it's unique uh than it's unique but the longer the primer the higher the annealing temperature right so the annealing temperature needs to stay below this 72 degrees because 72 degrees is the the temperature at which the polymerase starts extending so before that before this temperature the the primer needs to bind to the dna right so and of course we cannot make primers arbitrarily long um so generally speaking the length of a primer has to be at least 15 base pairs to ensure uniqueness right because if it's shorter than 15 it will start binding all over the place in the genome um and usually we pick primers which are 17 to 28 bases long um i would say that most commonly primers that we design here in the lab are 20 21 22 base pairs so that's kind of the the range so between 20 and 22 right but head the longer it is the more chance that it's unique but the longer it is to hire the annealing temperature so the higher the temperature needed before it starts binding so the base composition of a primer is is affecting this as well like i told you guys you have if you have an at binding then you have two hydrogen bridges if you have a gc binding you have three hydrogen bridges so of course you can imagine that a primer which has like 80 gc is binding much stronger than a primer which is 80 80 right because it just has more hydrogen bonds um so head the the best way is that you you want to avoid uh at and gc rich regions so if you design a primer you want to have a primer which targets a region where there's around a 50 50 at 50 50 50 at versus gc content right so the average gc content in in in mammals is around 50 to 60 percent um so head that will generally give you the right annealing temperature for ordinary PCR rejections and will give the appropriation appropriate stability of course when you're working with certain species of bacteria they might have 70 gc genomes so their gc content of their genome might be much higher right so you can always adjust the melting and annealing temperature uh to to make sure that you have the um that that you can target the right area right so the gc content we generally want to keep it between 50 to 60 percent right so 50 gc 50 80 um but of course this can change and we can change this a little bit but then when we change this we also have to change the temperature of the reaction so of the of the annealing step so this is why we have something which is the melting temperature so the melting temperature is defined at the temperature at which half of the dna is single stranded and half of it is double stranded right so it's like a an ld 50 where the the dosage at which 50 of the bacteria die and 50 continue to survive right but the melting temperature is defined as the temperature at which the dna genomic dna template dna uh half of it is open so it's single stranded and half of it is still in in a helical shape right so the tm is characteristic for the dna composition the more gcs you have the higher the t melting temperature because there are more hydrogen bonds so if i want to calculate tm that's actually relatively easy um if your piece of dna is shorter than 13 base pair um the t melting temperature is just the number of a's plus the number of t's times two plus the number of g's plus the number of c's times four right so if you want we can do a little example um had just come up with a random dna sequence which is like 10 base pairs long and then we can determine what the tm is if the if so w a stands for the number of a's the x t stands for the number of t's the y g is the number of n so on when your piece of dna is longer than 13 base pair um then you use this formula so you have 64.9 plus 41 times and then the number of g's plus the number of c's minus 16.4 divided by the total number of base pairs that your dna strand is long so now of course we can actually kind of figure out where this 90 degree comes from right so the 90 degrees is the temperature at which template dna is half open and half close so why is it 90 degrees well very basically if you look at a whole genome right you have billions and billions of base pairs so this term becomes massively big right so this term is just two billion base pairs for a human genome in the human genome around half of the bases are g plus c right so it's like 50 percent right so what we get is we get like one billion minus 16 divided by two billion so that is around a half right so that means that this whole formula for genomic dna kind of collapses into 65 plus half of 41 and of course 65 plus half of 41 is is generally kind of 90 degrees right so it's 85. something degrees but that's where the the 90 degree temperature comes from um and of course hey if it's if it's 22 base pairs then of course this formula doesn't easily collapse but if you're thinking about genomic dna then this whole formula very basically collapses to uh 65 plus half of 41 and that is the the the temperature at which genomic dna is melting so half of it is single stranded half of it is double stranded the annealing temperature is the temperature at which the primers anneal to the template dna and if i know the melting temperature of a primer then by definition is uh the the melting temperature or the annealing temperature so if i know the melting temperature right because i can calculate that using the formula um then i can just say well the annealing temperature so the temperature at which these which the the genomic dna or that the primer binds to the genomic dna is just the uh the melting temperature minus four degrees celsius right so four degrees below stuff starts binding together um because at at the tm half of the dna is single stranded um so a little bit lower stuff starts becoming double stranded so by definition the annealing temperature is the melting temperature minus four degrees celsius all right so i like if we want to design a good primer we have to make sure that there's no secondary structures which can develop at which can prevent the primer from binding to the genomic dna right so if primers can anneal to themselves or anneal to each other then the efficiency will be dramatically reduced because dna and start of instead of binding to the template that we want to amplify it starts binding to itself and and of course this this doesn't work right so a hairpin is when a primer kind of is able to fold back on itself right so here we see a hairpin so here we see a little piece of dna and then here we see that this ggaa is complementary with cct right so the primer is able to bind back on itself and of course at that point it starts annealing with itself so it cannot anneal to the template uh we can also have like a self dimer right so if we have for example the forward primer and the forward primer in in in another configuration right because these pieces of dna are just floating around in your cup so we don't want them to be able to bind to itself we also don't want to have dimers right because primers we always need to have two we need to specify from from where we want to copy until where we want to copy right so also here we want to make sure that the same sequence of like five six base pairs doesn't occur in the forward primer and then the reverse sequence occurs in the reverse primer right so we have to when you design primers you have to make sure that these things don't occur unless of course they only occur at like 30 degrees celsius right the lowest temperature in pcr is around like 58 60 degrees celsius so if the self dimers occur at 42 degrees then there's not an issue um but generally you want to design primers which are not affected by the secondary structure so that they kind of bind to themselves or or kind of make hair pins or or bind to other primers in the reaction and of course these dimers become more and more important when you start doing multiplex pcr where you're not targeting a single region but for example 10 regions in the genome at the same time because if you're targeting 10 region you're using 20 different primers so of course you then have to check that none of these primers one of these 20 match to the other 19 and of course at that point you're not doing it yourself on paper but you're having a computer figure this out for you so primers work in pairs you have a forward and the reverse right so since they are used in the same pcr reaction you have to ensure that the pcr conditions are still are suitable for both right so that means that if i design a forward primer which is 22 base pairs long i cannot have a reverse primer which is 60 base pairs long because then of course the annealing temperature would be way too far apart right so one critical feature is their annealing temperature which shall be compatible with each other so there is a maximum difference and this is not a rule set in stone but this is more or less the kind of guideline if you want your pcr to work well um then you make sure that the difference in the annealing temperature between the two primers that you are using is not more than three degrees celsius but the closer that they are together the better the annealing temperature works so for the first hour i actually went through it perfectly because it's 158 so the summary for the primer design criteria is one make sure that your primer is unique so that it only targets a single part of the genome only a single binding site for each primer the length should be between 17 and 28 base pairs this range varies a little bit but generally this is kind of the range in which you want to design your primers um the base pair composition should be around 50 60 percent right so the number of g's and the number of a's should be more or less consistent uh or the number of g's plus c's should be similar to the number of a's plus t and you have to avoid long at and gc stretches right so you cannot have a primer um that has been a a a a um question from a first time chatter hey first time chatter uh no one has to go how long have you been coding um that would mean that i have to reveal my age um but um i've been coding since i've been four five years old so when we got our first commodore 64 when i was really really young um i started coding directly because at that point you you you just had like little programs so you had to modify the programs to do anything with it um so but real coding i think i started when i was like 10 11 years old i really became interested in writing stuff for for computers um but let's finish the list so number three is optimized base pairing to minimize false priming right so um like i told you guys because of the way that polymerase works it needs to bind very tightly at the beginning at the five prime end but it needs to be a little bit less bound at the three prime end because when the polymerase binds and starts extending it needs to be able to add the base pairs which means that um the like the part of the primer should be a little bit loose at the melting temperature of the primer should be between 55 and 80 degrees celsius this is of course a long region or a long range but it has to generally you would say that it has to be like 56 57 degrees um so primer sets have a t anneal which with two to three degrees of each other right so if i have one primer then the other primer which i use in the same reaction should have the same kind of temperature um characteristics right we can't have one primer binding at 40 degrees and the other binding only at like 90 degrees because then they won't work well together and of course we want to minimize internal secondary structures like hairpins and dimers and these kinds of things so that's more or less about the theory and so if you follow these steps right so if you take your piece of paper and you have your genomic sequence and you just say i want to target this part and then what you want to do is you want to make sure that all these seven criterias are fulfilled and of course for a human this is relatively hard um because when you're human it's difficult but a computer can easily optimize these things um so in the next part of the lecture we would look at some more advanced primer design so what happens when we want to target not a single region um but for example multiple regions um and then we will just have a little introduction on how to use primer three um to design a primer to test a certain hypothesis so we will just go through all of the steps needed for primer design good so that's it for the first hour so everyone watching on youtube thank you for watching and we'll see you in the next hour