 All right, I'll get started right away. I have roughly 40 slides today. I might Let's see how we're doing about nine. Sorry about ten a little bit up to ten because if we're doing well on time I might very well skip the break and release already at 11 But that depends on you too. Let me know if you're too tired and want to break I'm gonna speak a little bit about nucleic acids and possibly about sequencing to today I didn't find any great slides on sequencing, but I figure I might want to try to explain how these modern sequencers work But before we do that We had a bunch of these study questions from yesterday, and I'm also have a couple of slides about statistics that I prepared yesterday evening So let's get started with the homestretch here. This is the last set of study questions. I'm gonna take you through Silent as the grave unless somebody starts So I have this trick that I learned I said that point. I can wait forever And this is the trick that teachers do what feels as an eternity for the teacher is roughly 20 seconds So it's one of the hardest part as a teacher is simply I will wait 10 minutes if I have to and eventually some students will speak Right, so if we think as if we think as physicists we can start to talk in terms of statistics We understand the distribution of states But the important thing is that the way when you translate that a chemistry you translate that to the del L You translate that to whether things happen or not Now of course the way the physicists would say it's that how likely is it for things to happen or not But in chemistry you very quickly get to the point where either it happens completely or not You're never it's very difficult if you had zero point one percent of a molecule in a population You wouldn't see it. You would say that a hundred percent has reacted So the important they're really really really important part with free energies is that it's kind It's the connection between physics and chemistry things that happen in the lab happen because it's low free energy If you're looking at a binding site and you have 10 different molecules, which molecule is going to be the one that binds? The one that has the lowest binding free energy and this is also how one molecule can displace another Because if you have a protein and you already have molecule a bound here and then you have molecule B coming along Well unless you have an infinitely low binding free energy, which you never have what will happen is that you're gonna have some sort of equilibrium Right at some point molecule a will release and that might just be once in a blue moon But once in a blue moon on a molecular scale is like once per second The once per second molecule a is going to dissociate away And then if molecule B now have has a better binding affinity molecule B will bind instead So it's all about equilibrium and statistics But because this equilibrium happens on scales of micro to milliseconds possibly seconds when you measure this in the lab It's gonna appear as if if molecule B is better. It's always gonna displace molecule a Sure. No, yes, right that it's the mass action law and everything applies, but at evil concentrations But you're gonna need a lot if say molecule B is a couple of kcals better or something You're gonna need a lot of molecule a to compensate for that and that's oh that's related to the stuff We talked about when it comes to drug design, right if molecule a doesn't have a very high binding affinity The problem is that you're gonna need to eat five kilos of it to compensate for something that has better binding affinity So you can't in practice It's another one of those things in practice You can't compensate for it by having more molecules because you would need to add too much the other Thing that I intentionally did here. I'm using a different concept Partly to confuse you and partly to get you used to it. I use something called affinity. What is affinity? Right, and when you talk about affinity you talk about high affinity high affinity means that something binds very well and when a Affinity is the way this is developed in the lab and an affinity you typically talk about you frequent to measure affinity in Concentration actually so you might say a micro molar binder or something and The lower and here's where it gets even complicated So a picomolar binder is a higher affinity than a milli molar binder Because you only need to pick a mole of it to push the reaction to the sites where you have it bound and Then to make even things even more complex when it comes to free energy the good free energy is the low one So it's not it's not rocket science But you need one needs to be a bit careful about words which of course no scientist ever is because once you spent a year in this If you're now sitting in a scientific talk and people talk about good binders high affinity low free energy Everybody kind of knows what you're talking about. So it's okay to be sloppy with words But in particular if you read papers and everything it's very easy to make mistakes here So you have high affinity that means low concentration required and low free energy and vice versa There is another concept that you might want to hear of and that's efficacy What is that? So when it comes to efficacy we free we actually that's actually that's good They say when it comes to fix we frequently don't talk about concentration. We probably should So what is the difference between affinity and efficacy? No, it's a efficacy. I would say it's the biological action. It has So remember the iron channels we I showed you it happens in oh a great example Is that what we did with mice or something so you did this mutation so that they're no longer sensitive to anesthetic? What might happen is that you have a mutation so the molecule still binds? But for some reason there is no effect anymore because you change something else in the protein So in some cases you might have a good affinity, but an affinity just means that molecule a binds to the receptor It doesn't mean that molecule a influences the receptor So the efficacy is rather the actual effect you're having and This is long been a problem. We have been very sloppy with this the last 10 years because we have all kind of assumed that Efficacy must be proportional to affinity if you have something that binds better. It's also going to have a stronger effect and In general, that's not true. Remember that we talked about these agonists and inverse agonists And we are learning more and more the iron channels are one example that you have plenty of molecules where the affinity and Efficacy are not at all proportional to each other So what you kind of would like to do is that it's not enough to just use the computational methods to optimize affinity But you would like computational methods to optimize efficacy the biological effect This very week there were a bunch of there was a company that announced a new type of machine learning based screens Where they optimized small molecules? I think it's called atom-wise the company So they use machine learning networks to Optimize the efficacy of small drugs rather than trying to optimize the binding affinity. Yeah So the free energy is always related to the affinity, but here's where things get complicated assuming Yes, and no actually it's a it's a great. It's actually not quite that easy The affinity is always if you have something. Oh, actually, no, I'll use my iron channels If this is my iron channel The efficacy is always how strong I bind to that site. Sorry the affinity the affinity is always how strong a bind to that site say minus three K cal per mole whatever Only how strong I bind to the site period And that might be that there is a hydrogen bond in your molecule or something being formed anything some small interaction That makes it favorable to bind it there But the next question is of course, so what does this molecule actually do? Why does this molecule open the channel? And if we take the same channel and look at this from the top then we have these five subunits, right? I drew and I tried to hint that in some cases the molecules tend to bind right here between the subunits So what now if you have two molecules that they have one molecule there with one small hydrogen bond partner And then a slightly larger molecule With a hydrogen bond partner Well in principle larger molecules usually have better binding free energies But if we ignore that for a second if you put something larger and for something larger between these subunits You will probably force it to open more, right? So it's not just the that hydrogen bond or whatever you have here That's going to matter, but it's also what role will this how will this influence the two subunits and That's a secondary effect, but of course if you're interested in opening the channels It's not just the hydrogen bond there that matters. It's the real biological effect that we care about But that is also a free energy Because normally if you have say a closed state there and an open state there Normally we would have say the closed state has the lowest free energy and then the open state has a higher free energy, right? But then when you're binding something and now you're not talking about the binding free energy This is the free energy of opening or closing the channel The question is how much am I changing the free energy of the two states here? Did I make the open a little better or did I make the open state a lot better? Now this is also a free energy But now you're not talking about the free energy of binding of your molecule But you're thinking when you are moving your molecule between the protein the entire channel between two states How much did your small molecule change this free energy? So efficacy is also free energy, but this is a much much much more difficult free energy to measure Here you can't just do a docking here You would actually need to do a simulation or model of how the entire protein moves and understand how the protein moves So it doesn't mean that all our knowledge about free energy and physics and states is irrelevant on the contrary, but This is of course why everybody me included have tended to sweep efficacy under the rug Efficacy is orders of magnitude harder to understand But long-term you're going to be way more successful at designing drugs if you go for efficacy than just affinity If you have a bad affinity, you're not going to bind in the first place So then it's then it's completely irrelevant. So you need good affinity and that's That's of course why all these computational screen are efficient in a way, right? That unless you have really good affinity, you're lost anyway So what you do in practice is that some good drugs may well have outstanding affinity, but if they don't have a biological effect We will simply ignore them and the screen So it's it creates a bit of extra work, but it's easier to do that extra work than having a really good model of the protein Sorry, that was a long detour on number one What is easy hard to get in the lab versus not necessarily simulations, but computer studies in general So is that easier hard? Yes or no in principle. It's easy, but this requires something. What do you need? You're gonna need some sort of assay. You're gonna need some sort of way of testing it in our cases when we have iron channels This is why we love working with iron channels because as Reba showed you in the lab You just see do I have a current or not if you don't have a current the channel is closed If you have a current the channel is open. It's also my trivial But in general if you have a gpcr How do you find out if one molecule is bound to another? Well, you can probably devise an experiment for one molecule But if you're gonna need to do this with very low concentrations for any arbitrary molecule, it's not entirely easy The good thing is that lots of people have thought about this there are standard so called calcium screens and everything for lots of receptors So there are dozens of fairly efficient techniques. So The first thing you need to do you need to find go and find some sort of screen to determine whether your channel or Your receptor works or not the second you can measure how much your receptors working Then it's just a matter of gradually adding more of the small chemical and then measure how it changes and then it's trivial to get The free energy so when it comes to simulations or computer stuff in general, what is easier hard to get there? Sorry, let's say that again So the hard part is the lab you're gonna need some way to measure some effect on your receptor, right? If you measure some sort of receptor in principle getting the actual free energy is not entirely trivial But normally we tend to approximate affinity with efficacy. Sorry Because measuring the as long as you have some may of measuring the effect whether you're conducting ions whether your receptor works Measuring if it works. Well, if you can do that you can approximate whether things bind good or not Simulations are kind of the well and in simulations here include bioinformatics and everything we can measure interactions really well Getting entropy is hard But we compensate for this by being sloppy and fast and Fast is of course the keyword there because we are sloppy and because computers are very fast You can never get in general. You can't get close to the accuracy you have in the lab and computers There are some exceptions, but the advanced with simulations is that you can afford to test millions of things So the key thing is go for throughput And this is really the way modern experimental research is going through You're going for it. You don't necessarily try to find the most accurate method, but you compensate by having tons of information And I'm gonna show you later how you can use that that if there is one large trend in experimental science the last 15 years Is that we're going for high throughput? So what are the two conceptual ways that you can calculate or estimate a free energy? So in general if we start with the second one the point is that the free energy only depends on states So if you what if you can if you can find a way in a computer to mimic how you move the system from one state to another and Move could be that you have to pull a molecule or that you change one ligand to another ligand as long as you can Describe the change in the system some way if you just gradually slowly do that change you are doing the work in the computer right and The work you need to move from one state to another the difference in the free energy Is just the amount of work you need to put in or pull out of the system to do it The neat thing with computers of course you can cheat the only thing that matters is the beginning state and the end state in Nature all the intermediate states have to be realistic in the computer. They don't have to be So you can think of this as alchemistry in computer I can really turn lead into gold and calculate how much that cost me Which is of course irrelevant if you do it in isolation But you can imagine if you needed to find out the way for complicated protein to fold or something I can go through State that are impossible on the way if that makes it easier for me to do the calculation The other part is just based on counting And that's really the same thing as these guys you want to find out if the dice is fair Just keep throwing it and calculate the outcome, right? This might not seem super smart But there are some really cool things that we are hoping to do here in the future remember the cry in facility. I showed you With x-ray you in principle only get one state because you're gonna get one crystal and you're gonna have billions of molecules in exactly the same Confirmation because unless you have all those molecules in the same confirmation. You're not gonna have a crystal You're not gonna see a beautiful diffraction pattern But with the cry we a microscope in principle we might have tens of thousands of different molecules So imagine I take an iron channel under conditions where it's going to be kind of 50% open 50% closed The channels are not going to be halfway open. You will have some channels that are open other channels that are closed So in cry you em in principle We have the entire Boltzmann distribution in all those micrographs. So you can likely start to just count samples If you have protein protein interactions if you have protein a protein B and then some AB complexes We can probably long term We will likely be able to find the free energy just by throwing this under a microscope and then calculating how much we had a how much We have B and how much a B we have Today we can't do it, but because the technology develops so fast So cry you and you can imagine a cry you a microscope is kind of being a Boltzmann detector Which the structural biologist would cringe if they're here, but I like to use them So one way of doing this work if I have something bound to a protein I can gradually pull it out of that binding pocket That's probably the simplest way I can imagine and then I just measure how much force I have to apply when I pull it out What is the drawback with doing things like that? It takes time and it takes time because you have to pull extremely slow and no matter how slow you pull this It's gonna be very very fast on a molecule on a macroscopic scale So if I put if I pull one nanometer per nanosecond That's gonna take well That's gonna take an hour or so in a computer, but then you're really pulling this out with one meter per second Which is an extremely fast way of pulling something out So you very quickly end up just generating friction just pulling something through water with one me Take anything in water and just pull it with one nanometer per second, right? You're gonna feel that the water makes resistance So you're gonna create friction and heat if you pull unless you pull extremely slow And that's why we end up with these free energy cycles instead. Did you understand them or should we go through them again? I'm gonna do this in the sense of this delta delta g values at the same thing. Actually, do you know what? Let's me wait two questions. I'll do this. It's easier to understand with the delta delta g values I think so what's the free energy of salvation? That's the probably the simplest free energy you calculate Right, and that's gonna of course gonna hear you That will correspond to an energy whether it's advantages or disadvantages just as we spoke about the amino acids in the beginning of the course The reason why this is important goes into the second question anytime you want to design a drug or something Your good question at the start of the lecture, but it matters on the concentration you have of a If this molecule a you should now go say into a lipid binding pocket in a membrane If this molecule is not hydrophobic enough You're not gonna have you might have a huge concentration of this in the cell But that's not gonna help you unless you have a huge concentration in the membrane where it's actually gonna go, right? But that's easy. How can you get a huge concentration of it in the membrane? Make it super hydrophobic So what happens when you eat this chemical? Won't dissolve at all, right? So now you're not gonna it's not gonna go into the blood in the first place So this is a problem. You can't have it hydrophilic because then it won't go into the membrane You can't have it hydrophobic because then it won't go into the blood So one way of doing this is some of these molecules are so amazingly potent that even if they are super hydrophobic There will be a tiny amount dissolved in the blood and that might just be enough you have some fat to in the body But the other thing you can do if you have extremely hydrophobic drug you can create some Artificial ups, sorry artificial micelles. So if you have your small drug Now we create a micelle of small lipids around it I'll draw the head groups first So before you eat this we will build this into a micelle So I will embed it in lipids Because the lipid head groups are polar. This is gonna dissolve in your blood At some point when this goes into your cell this micelle will likely fuse with your cellular membranes When this fuses with your cellular membranes this will be delivered into your membrane There are there is particular company in Copenhagen, oh sorry in Odense in Denmark specializing on cancer therapeutics delivery this way because you can make these micelles even fancier. So if you have special Properties on the surface here. You can ideally decide what type of cell you would like to target this micelle to And then you can have a bunch of super poisonous chemicals here that you would like to only deliver to a cancer cell It's like liposomes. Liposomes is a slightly fancier version of it when you have actually have a full bilayer and everything So most good things nature already uses so this is just if you want to do something really smart I won't have time to go into the liposomes But if you want to do something really smart in general start by thinking what would nature do a Nature probably already does something when it comes to targeting or anything and then try to borrow that idea So that brings us to this delta delta g values Let's do the e-act I would say delta delta g is by far the easiest part So let's start with that. What is a delta delta g? So why are there two deltas? So it's a change in a change in free energy. So the first thing is that what is zero k calc per mole? We haven't really spoken that much about absolute free energy, right? It's difficult to talk about absolute free energies So almost all the free energies we have talked about I have start to say that you have a state a and a state b And what is the change in free energy when you move from state a to state b? So the first delta is really that we are talking about changes between two states So the second delta and again these change between these two states this can be like a binding free energy So this is really say the binding Let's call it affinity. So when you measure this in the lab, this will just be one number Like minus 4.3 k calc per mole or something But it's pretty rare that we're interested in the absolute number right the typical thing We're interested in if I want to compare two things that bind and now I'm going to raise that again. Sorry If I want to compare how much does it cost to bind molecule x versus how much does it cost to bind molecule y? I can of course say that x has a delta g of say minus minus 3.4 k cal and Y has a binding energy of say minus 4.5 k cal But again those absolute numbers they're certainly great. We like to have them if we have them They can tell us some extra things But if you can't get the absolute numbers knowing that y has a lower free energy than x by 1.1 k cal per mole is really useful And that's where you can say that the delta delta g from Say x to y is minus 1.1 k cal So conceptually this is not a hard number It's a very easy number the point is that you're looking at relative changes in binding free energy when you're comparing different molecules or Different salvation or something and this turns out to be a blessing when you want to calculate things too Because I'm gonna I would use a different examples. Let's assume that these are not small hydrophobic molecules, but that these are ions charges and The absolute salvation energy of something with lots of charges and everything can be gigantic Let's say that if x has a binding energy of Let's say minus 550 k cal per mole Actually, let's do that even better. I'm gonna be Proper here minus 550 plus minus 10 k cal per mole And y has a delta g of a minus 555 plus minus Let's say 12 K cal per mole What is the difference between x and y? Well, we can't say right we have no idea because the standard error here is larger than the difference between them So you have no idea which one is the better binder in a second We're gonna talk a little bit about these standard errors and of course Experimentally we might know this one, but this is my what you would get in a simulation if you had two very large free energies The problem here really is that you end up calculating something that is a small difference between two very large numbers And actually the absolute error you make here is great, right? That's not like 2% That's a very small error So what if you directly could calculate the difference from x to y? So then the difference from x to y that would be the ballpark of five kilo calories per mole So 100 times smaller So then your error would be in the ballpark of 0.1 k cal per mole. We would have a super accurate difference between them So the reason we get ourselves into trouble here Is that I first tried to calculate the absolute number for x and then the absolute number for y If I could just go straight between them without paying and getting 500 k cal back I would get a much better estimate of it So the point is that rather than first going down and then going up I would like to go directly between them and That's where we use these so free energy cycles so if I have Let's say if I just have molecule x there and just a molecule y there and Then I can also have a large receptor where molecule x is bound And a large receptor where molecule y is found actually, I Actually have the receptor in both of these cases too, but since the receptor is not interacting with the molecule We can kind of ignore it So this is one case, right? How much does it cost or gain us when we bind x and that is the other case How much does it cost or gain you when you bind y and the difference we want to calculate is what is the difference between those two legs But here you're gonna really gonna need to pull x away from the receptor or pull y away from the receptor or Grow x or y and the receptor and if x x or y could be a large protein This could be imagine that this is a 300 residue protein that has one small charge that changes You can't grow a 300 residue protein. It's impossible But if the difference between x or y is just one amino acid or something Doing that change is going to be trivial. I Can calculate that in a couple of hours and same thing here here I have the protein out in water changing one amino acid in a protein. That's trivial. I can do it on my laptop So that is trivial and that is trivial That is hard and that is hard But since I know if you're gonna go the entire lap around here, we know that the value is zero So the difference between the top and bottom leg here Must be the same as the difference between the left and the right leg here Okay, let's do the math The reason I why don't give you the equations here is that it's easy to make a mistake here because it depends on the direction of the arrow, right? So rather than saying that it is one minus two equals three minus four or something if you end up with this draw the arrows yourself and Then decide so if the arrow Let's call this delta g1 and Then let's call this delta g2 And this one is delta g3 and this one is delta g4 If I go in the direction of the arrow the sign is plus if I go in the opposite direction the sign is minus degree good and Then we start up in the left top corner So if I go in the top leg The change in free energy is delta g1, right? I just go there and And then I go down so then I go in the opposite direction of delta g2 so then it's going to be minus delta g2 and Then I go in the opposite direction of delta g3 So it's still going to be minus delta g3 and then I go in the direction of delta g4 so plus delta g4 and That now I'm back when I started where I started so if I'm back where I started and The free energy only depends on the state this by definition has to be zero It can't be anything but zero If this was not zero, I could just if this was negative I could just keep doing this circle and I would gain an infinite amount of energy if It was positive I would just go in the other direction and I would gain an infinite amount of energy and that would be a preparative mobile Which is against the laws of thermodynamics So now I can just decide to move some things over. I can you can play around with these and move them over anywhere you want So I will keep delta g1 and delta g3 on one side. So I have delta g1 minus delta g3 and Then I add delta g2 and Then I subtract delta g4 on both sides So the difference delta g1 and 3 is the same thing as the difference between delta g2 and delta g4 And again, I can't repeat It would be great if there was a standard for how we drew these arrows, right? There isn't So don't for a second look this up on into don't try to learn this formula by heart because it's super dangerous Because suddenly somebody's going to do a simulation in the other direction and all your numbers are going to be completely off So if every single time I do this even after 15 years I sit down I draw the cycle and I solve that small equation to make sure that I get the narr or the arrows, right? If you pull it away, yes If you like that you can again you can you can decide this any way you want right free end is irreversible So this is just a way how you define the process if you use it the arrow in this way This would correspond to gradually growing. Normally, I would never try to Normally, I would never try to pull it away I would probably try to grow it inside the binding site or something But that's not entirely easy that of course you can just reverse the sign there and reverse the sign there Then you just reverse the sign on those two to that works fine So any time Discuss back don't be afraid of equations. These define your own equations If you end up with the absolutely worst thing that can happen if you make a really stupid definition You can end up with some extra minus signs here and there That's not the end of the world as long as you keep your tongue right in the mouth and make sure don't forget those negative signs This is not limited to free energy. This there are lots of examples experiments and everything when you're done with this set Define an entire cycle and solve it some other way instead We spoke a little bit about potentials of mean force. You might just want to have heard about it because it's related to the lab You did yesterday So the potential of mean force is something that you could get roughly from this if I gradually pull this away And just integrate the force as I'm pulling it away I would get a free energy as a function of the pull coordinate That would describe how much energy do I gain or lose when I'm pulling it away and that's also related to the lab You did yesterday when you're pulling this small protein through the membrane you could calculate a potential of mean force That shows how the path changes We're not really interested in that in this case But assuming that you would like to know what is the free energy as a particular amino acid is coming out of the ribosome So if you would like a free energy as a function of some coordinate Then a potential of mean force is a pretty neat way of getting it and the last two thing. Yes so what's gonna happen if you let's assume that we simulate this and Here you have those are not X Or well X or that's a time might be better So this is the force if I'm gradually if I have a simulation and I attach a small probe to my protein What's gonna happen? There is a there is just that this builds on statistical mechanics, right? So in addition to my force that I'm pulling the protein with the protein will constantly be hit by lots of water molecules Things are gonna move there is noise. There is you have finite temperature That just means that atoms will bump into each other all the time So even if I don't move my probe, I have it completely fixed. I haven't even started to pull yet the force Even though the the instantaneous force should be zero sorry the average force should be zero then right on average the molecule should not move But if you look at this in a simulation the force is gonna look something like that You have lots of values up and down now on average is gonna be zero But if you just pick one of those values, it's can certainly deviate if I now start to gradually Pull this I might know that the average force should maybe go something like This as a function of the coordinate But since this is gonna fluctuate so much you need to average this over tens of thousands of steps in a simulation To get these average values at each point Because imagine all if you imagine if you're pulling a protein all the side chains will move right Depending on how the side chains will move things will bump into each other Occasionally the protein might even move in your direction for a couple of steps or something So this is just your way to average you can you can either think of this as averaging out the noise or Averaging over all the possible microstates of the protein. So this is very much related to the entropy. Yep So the block averaging is more general. I'll come back to that I have some special slides about block averaging has to do with any stochastic property so this is particularly the reason why you need to average this in a simulation is that if The protein was a completely rigid ball if you didn't have any water If you didn't have any interactions with anything else if I just started pull it There wouldn't be any deviation in the force, right? If you had it to zero Kelvin nothing else would move the I would only need one single step And I would know what the force is But in a real system because all the atoms move they move between different microstates You're gonna have fluctuations all the time because you're sampling different entropy states And we need to include that sampling of the entropy and that's why we need an actual simulation to calculate the force in Remember the simulation you did yesterday Did you look at the structure of the lipids when you pull the lip the protein through it was pretty chaotic, right? Things all over the place. So if you now just measure the force on that protein, it's not gonna be a fixed value It will fluctuate wildly But if you average this if you keep the protein fixed at some position and average this over 10,000 steps You would likely be able to calculate a pretty good force Now the way you do that has to do with block averaging But the block averaging is pure the block average is just a method we use to calculate both the mean and the Statistic layers. I'm not gonna answer the last two questions here I prepared a bunch of separate slides on the statistics because this is something that is important for you for the rest of your careers Errors Hopefully You're familiar with averages, right? You all know what the average or the mean value is So what's the difference between the mean value and the average? We use them interchangeably the mean value is the same as average If you go into really hard core statistical mechanics You end up with a concept called whether a system is ergodic or not and that has to do with an average over time Or an average over an ensemble So an average over an ensemble is that if you're putting a sample in an NMR machine You have billions of molecules. So even if you just measure just once that's going to be an average over billions of molecules In a computer simulation like the one you did yesterday, you only have one molecule, but you average over time, right? In general these averages if you have enough molecules and enough time They should be the same and then the system is said to be ergodic Not related to this course. There are some very esoteric examples in physics where that's not the case but in general these are the same and averages and mean values are exactly the same. Are you also familiar with standard deviations? Okay, maybe maybe not I'll go into the standard deviation in a second the reason why these things get more complicated is that it's not quite There are many very similar concepts that we all tend to confuse a bit when we start talking about this And that's why I'm going to take this from the beginning So imagine if you have a perfect coin if you flip this coin enough times, you know that eventually 50% of it will be heads and 50% of it will be tails You will go towards that value, right? But the problem is you flip this 19 times There is no way you can get 9.5 heads. It's impossible So there are two kind of averages here one of the average that we hope that with or either in this case If we know that the coin is balanced, we know that this should be eventually be 0.5 But we also know that we can't assuming that you get sent 10 out of 19. That's roughly 55 or so That's also some sort of average, right? So point 55 is the value we got That's the average we got when we flipped this 19 times 0.5 is kind of the long-term average that we expect to get to So 0.5 is the property of the coin and 0.55 that's kind of an outcome of our measurement Both of them are important numbers. They're not the same and Initially you tend to use them interchangeably, but they're not So the easy part here is the average are mean the average are mean is the second part of this 0.55 if you have a set of data points whether that is one Three or five hundred fifty billion nine hundred sixty five thousand nine hundred forty nine The average is just calculate the mean value of all of them whether you do that with an Excel computer or manually I don't care, but if you just have those data points you can calculate an average That might or might not be representative of the coin You might if you if you throw a coin a thousand times in theory even if it's a perfect coin with a probability of 0.5 raised to one thousand you're only going to get heads. That's a very small number, but it's not zero So that's a valid outcome The average so the average is always the stuff you see in the lab any time you make a measurement you have an average Any time even if it's just a single number That's probably just because it's an average the machine calculated for you But any experiment or simulation you're going to be calculating averages if you don't calculate averages You're not really observing anything The extreme example if you just flip the coin once the average is going to be well if you calculate the probability of heads That's going to be either zero or one point oh It can't be zero point five But it's an average of one sample Which is kind of meaningless to talk about it So you're going to need some other if we have agreed that the average is always was we used to describe the outcome And if an experiment the other stuff I had on this previous slide We know that's given a coin The property of the coin itself the property of the random process We're going to need some other way to describe that zero point five that we should eventually get to and that's what you call in mathematical statistics The expectation value We eventually expect well, sorry. We expect that the probability of heads should be zero point five So if you calculate an infinite amount of samples, it's our infinite amount of samples 50% of them should be heads That's a property of the process That's a property of the coin if the coin is perfectly balanced there should be heads 50% of the time It doesn't mean that you will get that outcome But if the coin is perfect that is a property of the coin. Yep But but the point we ignore that because the coin is 50 50 right that it's a model Remember models are always wrong but useful Expectation the other way you can describe this as Just the way that you cannot yes, we'll get back to that in a second You can certainly calculate the median to actually I don't have a slide on that Let me do I'll put describe what the random variable is Just the way you can think of sine so sine is a function that given an argument between zero and well zero and anything You're gonna get a function value that is from minus one to plus one, right? So you put in something and then you get a result These process are really different But it's it's actually fairly useful to think of a cause of some sort of random process or random function or even random variable And this coin is essentially a random variable that is either zero or one With and again the average the probability of getting heads is 0.5 the probability of getting tails is 0.5 So any type you have a sort of a stochastic or random process You tend to talk about that as a random variable or as a random function and in this case x That's kind of my random process here. So e of x is the expectation value of this random process and If they said the expectation value of heads here, that would be 0.5 So this is just a way of describing I Don't care whether it's a coin or a dice or anything, right? There's this just some way that I'm getting a number from or a property from So this is usually the way you describe random process in mathematics I haven't forgotten your questions. I'll introduce the average the standard deviation first and then it makes more sense What you normally you do this all the time? So in an experiment you do not in general know the expectation value because you how many cases that you have a coin and that You know that it's perfectly balanced you're gonna start flipping it right to test whether it's balanced or not and There are not a whole lot of experiments you do where you know the answer. You're doing the experiment because you want to get to the answer So in general, we don't know the expectation value. That's kind of what we are interested in and In all cases, I can't even think of any obvious exception in the lab sociologists different We tend to approximate the expectation value with the average That's an approximation. Is this a good approximation? It depends, right? So if you flip your coin once It's not a good estimate. So the quality of this estimate is going to depend for instance on the number of samples so let's Let's assume that we're gonna calculate the average income of people in a room. So let's pretend that we have a number of This sorry and this income goes from zero to say let's say one billion dollars Normally, most of us would have an income down here, right? But what if you have say one person here two people there one person there three people there and then you have one Venture capitalist in the room So you're gonna say yeah All the students are so great. We have one two three four five six seven eight people and the average here would might be say dollar hundred fifty million they are filthy rich all of them so that's Hmm. Yeah, the expectation value. So there are two problems with this the expectation value here The mean is probably not a particular good property here one alternative way of that is to try so much try to ignore billionaires, but You also the problem is you can't go in and cherry pick your samples after you've seen them Because that leads to other brains that because if you just ignore that part At some point what if there are nine billionaires in the room at some point it starts getting important So what you do when you tend to pick the media instead rather than taking averages of the monetary value You would say that there are eight people so in this case There would be person either four or five so that would be the median income would be there So the median is a way to try to avoid influence of extreme values All this has to do with the likelihood of the distributions and that's something I don't have on the slides How is income divided in the world it's actually a really interesting sociological experiment the historical way income was divided is that you had a By model distribution said had a lot of people who made the closest zero dollars and then a small fraction that made This is the 1% This is still very popular politically. This is actually changing income is more much more moving to a Mono model distribution, so we're in General in the world we're starting to get in most of the world You're starting having the middle class being the most important part, but of course if you now Do it draw another distribution there right the average of this is probably roughly the same, but they're completely different distributions So at some point you would like to describe this You can think if you think about more physical things We talked about this protein folding right and then you had a probability that went something like that The probability of not being folded yet. That's a Poisson distribution or radioactive decay follows the same You could also have a uniform distribution. That is just flat So in general there's an infinite amount of different statistical distributions They when it comes to flipping a coin you have a flat distribution. It's equal probability of any outcome If you really want to understand what you're doing you have to understand the underlying process you can say what the statistical distribution is But there is a very nice property in statistics all called the central limit theorem Do you know what that is? If you just add enough enough samples, so as the number of samples goes to infinity any Distribution if you just sum them up We'll go towards a normal distribution a Gaussian Which is a very important result. I might have mentioned this before so how many samples do you need before we can approximate this? well, sorry and How much is a lot? Yes, skip the thousand ten This is way of course in principle for this to be strictly true the number is infinity, right? But if you just take any most normal distributions by the time you are to ten you will start to see the shape this happens way sooner than you think and That's all when I measure. Yes If you have a sort of complicated data, you have no idea what the underlying distributions cut that into ten pieces and Calculate the standard deviations you're gonna do pretty well So infinity comes much sooner than you think Sorry if you have No, but again, it's not gonna be perfect, right? I'm certainly not saying that it's perfect But by the time you start adding things up you will get to this much quicker than you think and there of course You can always have very special Distributions but the point is most distributions in biology are not special most distributions tend to be fairly flat I'm sure that both you and I we can find examples where you might need a hundred or something but in general you there is no way you need ten thousand and Because of this property we tend to use all these laws that are formally only valid in the limit of infinity We virtually always apply them the worst thing that can happen is that we over or under estimate our standard error bars Just so slightly is that good or bad? Well, it comes down to the model thing, right? If you only have ten samples, of course, it would have been nicer if you had ten thousand structures to calculate your average from but We can't do ten thousand structures. That's impossible So now we have a choice we have ten structures Do we try to provide some sort of statistical estimate of the accuracy of our results or do we ignore it completely and? Providing something is a hell of a lot better than ignoring it completely So that's it's a model. It's wrong, but it's useful So the first thing we're going to need to decide is something and that's very much relates We just spoke about that what is really the spread in this do we always get do we only get numbers between and Here's that here the dice is really bad But if you're measuring the length say the length and the height of people do we only get numbers between 180 centimeters and 182 centimeters you have a very narrow distribution or do we get numbers between? 150 centimeters and say 2.3 meters or something the average is going to be the same, but the width of the distribution is going to be extremely different The way you measure this is with the standard deviation and Here it gets a bit tricky because formally Formally formally formally there are slightly different definitions of standard deviation of a distribution versus standard deviation of This an individual sample I'm going to stick to the sample for now on I'll tell you what the difference is in a second The standard deviation just quantifies the amount of variation of a set of data values or the individual data values in the set Whether you're calculating this for one person or a billion people taser if you're measuring the height of one person or a billion persons It's going to be the same Because this is some sort of property of the underlying distribution, right if I take the height of one man He might be there there there or there if I take if I then pick the height of say another one Again if we ignore any systematic deviations say measuring in men versus women or something But any individual sample we expect to follow this distribution So if I just look at one sample at the time It doesn't matter if it's the first sample or the 954th sample the property is going to be the same and the fact that I might get something out down there up there or down There that's the same the way you define this is That well when we talk about the deviation from the average, we don't really care about what the average is You calculate the root mean square Difference around the average so we take each sample. Oh, sorry. There should be a sum there. I forgot that So we for each sample we subtract the average and then we square the deviation from the average I do that for each sample. I sum them and then I divide by the number of samples minus one And then I take the square root of it If you don't take the square root you're going to end up with a variance Which says units the variance is quite useful mathematically But it's the variance will have units of in this if you're talking about heights the variance is going to have units of length squared and It's a bit conceptually difficult to think about that means so it's by taking the square root I now get something that actually tells you in centimeters roughly how much I would expect it to vary So it's instinctively it's much much nicer to think about the variation in the same units as the measurement So sorry say that again The sum should be outside there. I do this for each sample I subtract the average and then I square it and then I sum that for every sample The formula I'll do that sigma is the sum over all samples I X I minus the average So again, this is all inside the sum and then I divide by the number of samples minus one You might have seen this without the minus one, right? This is so I'm not I hopefully you don't want to spend the next three days of me going through all the math here If you don't if you only have n here you get the standard deviation of the distribution and This has to do with the fact that No, I won't even go there if it's n minus one It's a standard deviation of the sample if it's n It's a standard deviation of the distribution if you have a million samples here It's not going to matter right in the limit of an infinite amount of sample. They're going to be the same But when you only have very few samples This has to do with the number of degrees of freedom and how we're calculating the mean from we are also calculating that mean from the actual Samples so if you have ten samples we calculating the mean from the ten samples and then we really only have nine degrees of freedom left So for small numbers of samples this start to be a tiny effect But this is going to be the difference between this one something divided by the square root of nine or something divided by the square root of ten If you look at any number and I'll show you some standard errors. It's not going to matter in typical plots So in this case Just you can't just ignore it for now I promise that I want to grade you or down if you stay in my instead of n minus one Here is an example of I think this actually actual measurement data height and inches for a sample of hundred adult males Do you see the shape? Okay, it's not ten. I cheated it's a hundred, but it's certainly not ten thousand and you can all recognize I hope that this is start to be a Gaussian shape You don't need a whole lot of samples before you start seeing the Gaussian of the central limit theorem But of course if you only have a small numbers of samples We're technically usually have something like zero point zero one Samples out here you can't have that because it's discrete samples either. I have a person I don't have a person But if I get one for each of these person each of these measurements I made I'm not really this will not be narrower if I keep adding if I go to ten thousand adult males instead This distribution would not get more narrow. I would just it would become smoother, right? And I would get more data But it's not going to become more narrow because you still have some very short and some very tall males So no matter how many people I add you would have a smoother curve But there is still an inherent variation in the sample and the distribution and that's what we described with the standard deviation The standard deviation you might be able to calculate it more accurately with more samples But it there's not going to be a trend that it shrinks or grows It's a property of the process It's a property of the random process a random variable and in this particularly how much is it natural for an individual sample to Deviate from the expectation value in this case We don't know that you can the expectation value is going to be somewhere here, right? But we don't know exactly where it is so here It's probably going to be pretty decent to approximate this with the average if You absolutely want to if you're really concerned about the standard deviation If you had your example with the billionaires if you really would like to use the median in This way you can use the median instead of the average there because this would really be If we're doing this for the distribution you should calculate this around the expectation value if you're doing this I won't go there, but basically you can make this as complicated as you want There is a reason why there are entire university education and mathematical statistics. Yes So the Gaussian any if you take any random variable no matter what its shape is if you add up enough samples It will eventually be Gaussian and eventually is infinity But my point is that infinity happens relatively early at in this case that throughout 100 samples We start to see it. This is the beauty and this is why we care so much about the normal distributions It doesn't matter what the underlying random variable is if you just add up enough samples it will eventually be Gaussian Well, but if you if you're looking at an individual say if you're Exactly, so the distribution itself won't become Gaussian Sorry, if you measure samples and the samples come from other distributions If I'm calculating an average of how many atoms have decayed or decaying at the certain position in time or something So that the average of the samples will eventually become Gaussian, but of course I'm not changing the property the distribution is the same underlying In the interest of time, I won't go too much into the other distribution, but this brings us to some other point Sorry, I'll go back this You can all calculate the average here and use that to estimate the mean, right? And this is what you're going to do all the time in the lab But the question is how accurate is that estimate in this case? You might say that it's well 68 72 inches you probably know the average to within four or five inches here, right? But I wouldn't say that you can say that within one inch what the expectation value of this process should be So we can estimate it, but there is some sort of inherent built-in error in our estimate If you only measure one individual that error is going to be infinitely high because we don't know whether was this an outlier There or an outlier there for one sample. We can't say anything But of course if you had a billion samples here, you could probably make a very accurate estimate of the expectation value And this is kind of where we want to get in the lab We want to says you're estimating expectation values anytime you're measuring something you would like to Ideally you would like to have a good estimate, but you would at least like to know how good or bad your estimate is so Anytime you do a measurement or throw a dice or measure the height of people or something if you now Calculate an average from the say the 10 times you threw the dice or the 10 times or 100 times you measured the heights of adult males That's kind of a random outcome, right? It depends on the bit on chance if you're throwing If you're throwing a coin, sorry if you're flipping a coin a thousand times you will get some sort of average I don't know what the average is sometimes it will be lower sometimes it will be higher if you're flipping it five times There's going to be more variation But this estimate of the average is in itself a random outcome So we can start to think of that estimate as a random variable The more and it but in this case is a bit different the more sample I do the better the estimate the more narrow that estimate will be So the more data points you have The smaller you expect the standard estimates to be because they're going to be closer and closer and closer to each other That is perfectly fine because it's not it's no longer the same process throwing a coin a thousand times It's not the same thing as flipping it five times Measuring the height of thousand males is not the same thing as measuring the height of five males So it's the fact that those two have different standard deviations is perfectly reasonable the way you measure this and you can actually you can prove this by Actually exactly defining with this random processes and everything and that's when you're not having to worry about these standard deviations of the Distributions versus the standard deviations of the sample etc The beautiful thing is that the final result is super simple The standard error estimate is The previous standard deviation divided by the number of samples. Oh No square root again. Sorry. This is this is when you start preparing slides that close to midnight my bad s equals Square root of n my bad And I slot a so I can't change it right away At least I didn't at least I found it I'll go back to the previous slide where what this tells you is that as you are adding more and more samples here Your estimate here will become narrower and narrower and narrower You're gonna get a better and better estimate of what the two standard average here is now there's no There's not necessarily any such thing as a free lunch if you would like s to be a factor of two smaller If you would like your estimate to be twice as good, you're gonna need a factor of four more data points If you would like this to be a factor of ten better, you're gonna need hundred more data points This quickly becomes prohibitively expensive and that's why everybody we have to accept that we we can't get those Beautiful infinitely great samples with tens of thousands of data points You have to make do with a very small number of data points in practice. Yep Well, so the standard deviation, this is something I could measure from my samples, right? I Can we have the formula here? This is just the average which is my estimate of the expectation value. I can if you have five numbers I can calculate what their average is I Can calculate the standard deviation of those five samples So if I have those five samples the standard error would then be the standard deviation divided by the square root of n So that's gonna be the sum of all i x i minus the average squared divided by n times n minus one If you have five if you have five or five hundred samples that sum goes over the samples first You need to go over the ones and calculate what the average is and then you go through them again and calculate What is the square the sum of the square of the deviations from the average once you've done that? We can calculate with the standard errors This is what you see in virtually every single plot in a biology textbook or paper anything when you see standard error bars and The way the column is this either se or sem You will free and this is so common that people are not gonna people are not gonna waste the space and their precious article to Write it out. They will just say X plus minus sem Right next to the plot That tells you that in this case the thing that they have marked is the standard error of the mean So the key difference here is as even no matter how much data you collect The standard deviation is a property of the sample or the distribution that does not change whether you have 10 or a billion samples S on the other hand That's your estimate of the deviation in how good the your estimate of the true Expectation value is the more data you collect the more this will go down And it will go down as a square root of the number of samples When it comes to these formula, I think you should know them At least you definitely need to know the pattern that this S shrinks by the number of sample squared And the other reason for knowing them that there are so many papers where this comes into play and you to understand it Yep No, not the standard deviation the standard error estimate s if you have more sample, sorry Oh So no, it's right. It's good. It's correct in this case. It's correct When you calculate when we try to estimate the expectation value, you don't know what the expectation value is right So we estimate the expectation value by taking many data points if we are not happy with that estimates we can collect more data points and Each time we calculate this average That is a random. It's a random process the exact way I don't know if I throw a coin if I flip a coin a thousand times and I can calculate the average But I don't know what the average will be this will depend a little bit on the coin, right? Sometimes it will be higher. Sometimes it will be lower So if any time I flip a coin 10 times The average there is going to be a random outcome roughly with a normal distribution And I'm going to get roughly five heads, but it's not exactly five as long as I keep this to five flips It's the same random process If I go to 25 flips instead, it's a different random process Because I expect the average there is going to have a smaller variation because I have more I'm throw I'm flipping the coin more times So do you agree with that right so in one case? In one case, I will have a large variation and in the second case. I will have much less variation Because that I'm going to get a better average here. So this is one Random process a and this is another random process B So For each such random process if we now start to think of this not the individual samples as the random process But the average here the average I get is the random value in this case I have one standard deviation and in that case I have another standard deviation But this is a different standard deviation. This is the standard deviation of the average I'm calculating, right? What the reason why this is complicated is that under the hood? I got these averages from flipping a coin and flipping a coin has an underlying standard deviation But that's it. That's a lower level standard deviation And you can imagine how complicated it is if we all start to say we have standard deviations and standard deviations And they are completely different It's not a coincidence that you're confused But I had to call it standard deviation here because it is the standard deviation So instead of being confused all the time we say let's invent a slightly different name for the second one When you just flip a coin once the process of flipping a coin once that has an inherent variation We'll keep that as its old name. That's what we call standard deviation This new property when we describe how accurate is my estimates? Let's invent a different name for that and that's why we call it standard error From now on forget the fact that it really is a standard deviation. Let's call the second one standard error So that's Sorry if I go back to this one again so that for each each person here There's a standard deviation of an individual's length The standard error is how accurate we have estimated the expectation value and actually I'll stay with that plot So of course if you go to a billion people here the distribution will be roughly the same But we're going to get a much much much better estimate of the expectation value So the standard deviation will be the same but the standard error is going to be better the more samples you have And that's where you can get better results by adding more samples It's not going to change the standard deviation, but it will improve your standard error a lot But that doesn't mean that it's trivial. So what you're going to see is that you're going to see plots like this, right? these standard error bars That's what they are typically so you see mean plus minus SEM n equals 5 That's their approximation of infinity. So the value 10. I told you you should be really happy if you find data sets that have 10 So here they have assumed that they have done all their status and if you love him, I've done this for three We all we say I if you were to correctly Characterized the underlying statistical distribution of this particular sample Jesus You would have a PhD just to try to derive one such standard. You can't do that And of course, this could be wrong. This might very well be a factor two error in that standard bar But we're talking about orders of magnitude here, right? So that knowing that is roughly that versus roughly that that's important So we all do this we assume that even if you just have three four five samples We can apply all these maps which is strictly speaking not correct What you should worry about is not the people who you do this what you should worry about is the people who don't even understand what this means Because that's a really dangerous part when they assume that oh the left plot is obviously better higher than the right one And then you have no standard errors whatsoever and that's the problem So in this case is this statistically significant So the treated I'm no at this an enzyme activity So the treated one here has lower enzyme activity and then you have a standard error bar there So this has to do with probabilities, right? I would say that most experimentalists and me too would probably see yeah, oh wow Yeah, that looks statistically significant. The error bars don't overlap And this depends a bit on what you do if this is a small support experiment You wanted to check you have some other hypothesis that says that the treated one should be lower You might very well be happy with it again I'm not saying that you should be sloppy But you can't invest your entire career in every single plot you're making because if you spend a thousand hours of that plot You wouldn't get anywhere So you have to prioritize the things you do if not if this is a small support plot I would be perfectly happy with this But you can translate this assuming that you have normal distributions, which we don't have but now we pretend that we do again If things are normal The mean plus minus one standard error of mean you would expect 68 percent of the samples to fall there And then you would expect 68 percent of the samples to fall there So you can certainly calculate what is the likelihood that things will overlap here and that they are not statistically significant Different is a bit of more mathematics to do to calculate that when you have two samples, but you can certainly do it My gut feeling is somewhere here. I would start to believe it, but it's it's not like I would bet my life on it If you would like the first part Where somebody aestheticism would say that something is significant one star You see these stars in lots of plots that you would I could have a 95 percent confidence interval Then you're gonna need plus minus one point almost two standard deviations Within two standard deviations something is significant one star That's the lowest that's the lowest level where a mathematical statistician is not ashamed of himself This is what we do in biology It's because biology is more complicated than mathematical statistics if you want to get somewhere to Very high accuracy like three stars accuracy. You're gonna need plus minus three standard deviations. Yep So in the yes, oh, sorry. So in this case, I was talking about then the distribution How likely it is for samples to fall within the distribution? Yes, well in this case what you're saying here is that it's standard error you talk if you're looking at the right bar here This is our estimate of the true value, right? So that with 68 percent certainty the true value falls somewhere between that and that and With 68 percent certainly the true value for the control falls somewhere between that level and that level But then of course if you now want to calculate What is the statistical difference that these are sitting different from each other it gets a bit more complicated But if a one sample, it's super easy the likelihood that the true value is between those two is 68 percent So in this case in this case now when once we moved over here now, we're talking about a standard error, right? So now I'm not talking about this can be 500 samples now I'm not talking about how much an individual samples varies now I'm talking about the variation in the actual value the average here that I don't know We don't know what the underlying value is this is our estimate of the underlying value So then it's true then it's with 68 percent probability it falls between those two To complicate things even further for you it would be great things if there was a standard for how to do things there isn't So suddenly you're gonna see a figure of a means with I'm sorry. I'm well aware that this is more here Say means with 95 percent confidence intervals see eyes. So here they have plotted confidence intervals with 95 percent So that is roughly two Sigmas they have used This is why you need to understand these things that it might look initially it might look that that is a much much much better experiment than this one The proportion of the overlap is like 0.6 here or something. So this looks like be a sloppy experiment This is a more careful experiment Because this is the people actually realize that you want probably 95 percent confidence intervals rather than 68 So it's dangerous just to look at standard errors and say oh the smaller they are the better They are it's frequently the opposite the ones that have the large ones are the ones who did their statistical homework So I would not if I had to invest here in a pharmaceutical study And if I was a venture capitalist and had a billion dollars I probably wouldn't invest a billion dollars is saying that that drug is better than that drug or There's quite a lot of likelihood that they're not statistically different The reason why this is important is we are in general in science. We are increasingly using computers Not just to measure things but to prove things Do you remember the Higgs Bose on a couple of years ago? How do you know that when you found a new elementary particle you use computers everywhere? That's just statistics How do you know that there really is a new particle with an energy of whatever giga electron volt I should know that So you calculate errors, right and then you calculate the standard error of mean and at some point you have to decide oh This is so good that physicists say that is not just likely it but it must be true But this is ultimately just a matter of picking it. There is no absolute truth here It's just a matter of picking some sort of standard error here That we think when it's good enough The one do you think a physicist say that something is so so statistically significant here That you approximated by 100.0 percent usually around five sigma a Five sigma is a lot So that basically five sigma the likelihood that that's gonna do you see that it's as you're going up here This increases quite rapidly at five sigma. There's gonna be there's gonna be lots of minds there But at five sigma physicists consider something absolute truth Which I guess most mathematicians would cry about that's the way it works in reality That's all I'm gonna say about Statistical properties. I have some new clay gases, but we're already at 1025 So I suggest we take a break unless you have any questions about this Yep, sorry say that again so I didn't confuse you. Well, so the reason for that is that There are there kind of two statistical processes and one of the result of another one So when I want to calculate the standard error That is just really the standard deviation of the statistical process of estimating the expectation value No, so that's well so that that is the standard deviation of the process of estimating the expectation value now That process in turn relies on the other process of just calculating averages, which I get from flipping the coin or measuring the lengths What I the important thing here is that you don't need to even I wouldn't know necessarily know how to derive these things You need to understand the difference between them and you need if you get a formula You can't use them interchangeably there is a world of difference between them And you need to understand when you're after a standard deviation versus when you're after the standard error So in general again the point with formulas not to know them by heart But that there are large formula books and that all this of course is available in Google and everything But you need to know when you're going to use one or the other So when should you use one of the other? Can you think of a case where we would like to use the standard deviation? Well, but let's be a little bit more concrete Say that you're Ikea and you're designing beds Why would you like to know the standard deviation of something? Well, there are some very short people. We don't care about them because they will fit in any bed But they're also going to be some very tall people But of course one solution to that is make sure that every bed is three meters long That's going to be very expensive and those beds won't fit anywhere. They're going to be nobody will want to buy them That's bad But of course if you start making your beds too short, they're going to be lots of Pia customers and say your beds are too short I won't buy them So you need to know how likely is it that's there are some long people told people in the population And of course nowadays we've standardized on these that a bed is like two or two meters or two meters ten centimeters long or something But that's based, you know, how likely is it that you have some very large or very small values? Then you don't care. We know what the average, but I don't care what the average Hight is in the population. I want to know how likely is that somebody's very tall So when is that you would like to know the standard error? so that's In most cases, but certainly not all in most cases This is what you're after if you actually want to you want to calculate an average and you want to say that how likely is it really that This drug is better than the second one that you want to present some You want to present something to somebody else and you would basically want to show how good a job did you do? Then you usually want to standard error, but not always 10 30 let's meet at 10 minutes to 11 and then I'll spend some slides 30 slides or so talking about your clay cassettes So we get started again Lots of good questions here, but you should get some lunch, too There were lots of good questions here in the break Feel free as you're not gonna have any lab today and Monday Tuesday Wednesday next week I'm gonna be traveling so that as I mentioned, this is the last lecture spend Monday Tuesday Wednesday studying Because that's what you remember That's what you asked for in the beginning of the course to have a couple of days to study before the exam and now you got them The second part I will in principle. I'm available almost 24-7 on email I have no idea whether I'm gonna Wi-Fi during the day next week when I'm an Oxford in particular But basically you can mail me any time and then I will try to get back to you as quickly as I can With the caveat is that if I get 200 emails is gonna be a bit of turnaround time there and then next Thursday I'm gonna have a Q&A session That we will start I think we're gonna be here I'll send her on the room, but then we'll start at 9 and then I'll be along until 1 p.m. At least and Then I'll come back after my next meeting if you need to talk to me longer. I Promised to I promised myself in particular to talk a little bit about new clay gases because you're gonna hit them later On with the KTH courses and everything We spoke a little bit both about RNA and DNA earlier on the course and just to test that what molecules are these Yes, I wasn't I wasn't quite thinking of it that simple. What type of RNA is it actually? It's a good answer, but It's TRNA. It's kind of the most famous RNA molecule you can imagine. What form of DNA is this? It's B DNA and how do you see that it's B DNA? So it's first if you even if even if I didn't show you a slide if I just tell to the pick a common form of DNA The most common one is B DNA so that would have been a good. Yes. Anyway, I'll tell you why in a second How much do you know about new clay gases? Are you familiar with them? So you can you can instantly separate with all these things are it So what's the difference without looking at the next slide So when you say everything except of it's correct that is without the phosphate and so what's everything without the phosphate? Sugar and the base. Yeah Good, so, you know, you know a bit of this the reason why I need to these These things are occasionally used almost Interchangeably and you should know the difference between the business just occasionally it matters So everything the entire unit is a nucleotide Remove the phosphate that links them and then you have the sugar and the base which is the nuclear side And if you remove the sugar too, you just have the base and Hopefully you all know those bases. I won't go through the details. Sorry. That's a bit of a multi-ideal slide And you also have uracil versus timing. That's the difference between RNA for uracil and timing for DNA And I'm sure you all know how you pair these up with hydrogen bonds, right? So what's the difference between a purine and purimidin? Sorry, sorry Well, so what is what? Let's go back then and have a look purines, adenine and guanine purimidins, cytosine, uracil and timing In principle, these are the hydrogen bonds that you're used to always see, right? Are these the only ways they can hydrogen bond? No How common are these hydrogen bonds? What fraction of bases would you expect to have these hydrogen bonds? I would even be a bit more generous, say 99, they're far the most common ones This is true both for DNA and RNA What I'm gonna, we're gonna start looking at just a little bit more into DNA first and then we'll talk about RNA So when it comes to DNA If you haven't, how many of you have read what's on the double helix? If you haven't to do it, it's a very small book But it's a really amazing book about scientific, James Watson is a very special person nowadays in particular But I, well, let's not talk about him Watson right now because I'm not necessarily his biggest fan But this is a great book. It's a great book about scientific discovery DNA exists in three forms, A, B and Z Z you will virtually never see A and B has to do with the hydration in particular by far the most common form is B DNA Which is also the one Watson and Crick first discovered and B DNA is characterized and this is how you could have seen it at the previous slide that you have two groups a large Separation between strands and then a small separation between strands and that just has to do with how you're turning the double helix So this is called the minor group and this is called the major group You can probably guess why that's this major and this is minor, right? So what are those used for? Are they different well, they're obviously different in size, but do they have any functional difference? So they decide where things bind they're gonna have proteins and RNA and lots of other stuff binding some of them Will only bind in the major one others will only bind in the minor one in general The bases are gonna be obviously gonna be more accessible in the major one So you would almost imagine that the major group should be the most common one But it turns out that one of the most important proteins binds in the minor group The Tata box or the Tata box binding protein The Tata box binding protein is what what does it do? You know it? It's the promoter is basically the start of gene encoding right you start to recognize here's a gene We should start reading this gene and turn it into RNA and produce proteins So it's kind of the most important probe one of the most important proteins that binds the DNA So if we talked about proteins before and I might even have had something the Tata box contains a bunch of Beta sheets in particular and you have the beta sheets pointing into this group. Why do you think the beta sheets point as the group? So they make hydrogen bonds with all those unpaired hydrogen but it's gonna bind really beautifully There are some and nowadays we have lots of examples of studies for this showing both in eukaryotes, archaea and everything Different types of Tata box complete Tata box binding protein complexes How they vary between the different kingdoms and how they twist the DNA we know a lot about this today Could you madding doing something here either in nature or? You doing something artificially if you would like to interfere with this process. What should you do? What would happen if you interfered with this process? You would silence the gene right so you wouldn't have it transcribed Which might or might not be good. Well, I assume that if you want to do it. It's something you want for a good reason So, how would you silence this gene? So how would you interfere with the sequence of the protein? Yes So rather than doing this 50 times this how are you gonna interfere with the first team? So you're basically saying that at some point you're gonna need to do gene therapy and you here to interfere with the sequence of your Protein or inject a different type of protein Well, you the body of course does it and it happens naturally and everything But in general if this is a natural process that you would like to turn this on or off Fetal hemoglobin that you would like to turn off for something Basically if you could put some sort of patch or a sticky plaster over this region, right? Then you would not be able to bind and that's what the body does. It's called a lambda repressor So lambda repressor is too as a pretty pretty neat cool protein It's a helix turn helix motif here that binds in the major groove And then it's exactly identical helix turn helix motif that binds in the major groove here And then you have a dimer. So this sits like a patch just over The minor group. There is no way the tata box binding protein can access this Sorry, I didn't catch that Yeah, so it depends actually it's a good question I difficult have what so what what will decide whether it comes up or not Free energy of binding. I have no idea what the free energy of binding is obviously It's gonna have a decent free energy of binding. Otherwise, it would never ever work, right? Otherwise, it would just go off instantly but There there is far more than the lander repressor There are tons of repressors in your in your cells and many some of them like the fetal hemoglobin We pretty much turn it on the day. We're born and then never turns on again So that's gonna be some very strong process and then there are other processes There are genes that you might want to express when your cells are dividing and that's gonna have ever a few hours or something So that there's a vibe variation whether you want to turn them on or off Could you imagine some other reason why you would like to turn a gene off? cancer So if you're in a cancer what basically happens with a tumor cell, right is that the genes start The genes start to just divide and you start producing more proteins then as I mentioned the entire metabolism is going in overdrive So what if you start to turn to randomly kill and randomly in repress all transcription in your entire body Well, if I did it completely you would die But if you started doing it a bit, what would happen? Would it be good for you? No What cells would it be worst for? rapidly dividing cells which are which ones The cancer cells so it would be far worse for the cancer cells So what if you add some small drug? This is actually a real drug. It's called cisplatin So it's a platinum ion and then to chlorides and to amide groups That binds to the nitrogen so two bases here. It's gonna be a super strong bond here It's covalent bond and it's so strong that you will start to bend the entire DNA and everything You can imagine that this DNA will be able to be transcribed This is chemotherapy cisplatin is using chemotherapy So you're basically adding things to bind to your DNA to prevent transcription There's there's what cisplatin is kind of amazing because that how this drug enters the cells and everything We have no idea just how it goes through the membrane is unknown, but it's a very important drug Sorry No, it's actually the so that there is a nitrogen. There's a nitrogen on each base here So I think it's carb. It's atoms N7. I think So the chlorides here are released and then you form covalent bonds directly between the platinum and N7. I think So there are Quite a few things that you can do when you interact with the array The other thing that I mentioned that although the Watson quick base pairs are by far the most common ones There are a couple of different ones You can have some base pairs that are wobbled that you have changed. You see here that you have changed the orientation of the adenine or the cytosine They're not the most important ones actually, but you can these are still normal base pairs that it says that they are paired But you can create some other really complicated pairs where you also call the hook stem base pairing And then you can even end up with interaction between three nucleotides So hook stem base pairs are also based on changing some of these bonds between the sugar and the base and you rotate those around Can you imagine which pairs pairs are going to be best Watson quick or hook stem? Why why is Watson quick better than hook stem? Because you see them because you see 99% of them This is the Boltzmann distribution, right? If there's 99% of them You can pretty much derive to first approximation with the difference in free energy between them Because it's so less than 1% that's hook stem base pairs So why on earth am I telling you about hook stem base pairs? So for when I was in your age, nobody cared about hook stem base pairs Yes, you would mention this sort of structural biology course, but it's completely ridiculous and irrelevant for everybody, but hook stem I guess What happened just a handful of years ago is that Remember those space where that I said that under some cases you can have interactions between three nucleotides But if each nucleotide can interact with the next one, you can form quadruples of nucleotides where you have one, two, three, four In particular if they're all guanine And they're all going to have Oxygens on the inside here and these oxygens are slightly negatively charged If you now put a metal ion say potassium or something in here With then that potassium is going to interact with all those hydrogens and you're going to get a fairly nice stable structure here And this happens Because this only works really well for guanine If you have now how small stretches of DNA if you have lots of guanine They can form these quadruple strands where it's not there's not that they interact pairwise and then the pairs interact For each of these layers you have four bases all bound to each other So they're quadruples instead of pairs and that's why they called g quadriplexes g for guanine and then there are four of them And I think yes, I even have a picture of the structure how it looks You see all the green parts here are the metal ions in this case three or four of them Then you have layers of these The reason why these are complex is important is that we've suddenly started to found them in genomes and they're very common Towards the end of your chromosomes the telomeric regions And there are some indications that they're probably very important to regulate oncogenes and everything And anytime somebody says oncogene, it's related to cancer, right? So there's quite a lot of research both in the us and umio university in the north of sweden Just a trying to find genomic ways of identifying your g quadruplexes just from these sequences of your dna b Can you identify the parts of dna that has actually has formed g quadruplexes? And you can can somehow try to use this in therapy or something either to well mostly to treat cancers Super fresh research that we talked about the last three four years When everything has happened here But again when I was until 10 years ago I had never heard of a g quadruplex and today it's something some of the hottest stuff you can imagine in dna Yep So what's likely happens here that you have lots you see a one two three four Oxygens in each layer right so that the electric yes the the cations will certainly repel each other I'm not there might actually be some water between pairs of cations. I don't know that Uh, but in general it's going to be all these oxygens that help stabilize them and the positive the good interaction between four oxygens Is like obviously better than the negative interactions of the cations repelling each other a bit And I would also guess that the distance between each layer here is I don't remember exactly what it is But the distance between the layers here is so large that you might very well be able to fit a water or so between each cation I have no idea. This is one of those things what frequently happens in science There is likely one of two things that's going to happen here Either this in five years people realized that they didn't find out anything We couldn't learn anything about cancer. It doesn't help us and then it will gradually wear out And the next generation won't have heard of quarter plexus again Because just a parenthesis or this takes off and becomes something super hot. I have no idea But be have keep an eye open for it because you're going to see lots of papers in this in nature in science the next few years Just a quick question Just thinking about I said hemoglobin with defining of one influence is the other This with if one of those finds the next one How would that form how these form in the cell? Yeah, I have no idea But remember this is more complicated than normal dna folding right because normal dna folding is just like a zipper You're sipping up two strands here. You're going to need to have four strands involved. I have no idea how they fold I'm not sure if anybody knows All we know is at the max texture you probably don't even need the methyl ions for them to form We know that the methyl ions stabilize them due to all these oxygens So I would guess that you probably have the dna forming this quadruple complexes first and then later on they bend by metal ions It's probably a research project for somebody So It appears that dna is not quite a you have more dna is not just that pure simple pair of Double helix structure that just continues for infinity. There is more structure in dna than you think So you can start imagining this is kind of some sort of secondary structure for dna Or motif or something although I have to this is a very special one. It's not like proteins when you're going to have hundreds of them But we know much more of dna today than we did just a few decades ago So there is kind of a secondary structure to dna. Well, you could argue that the helix itself is kind of the secondary structure, but Once you have a helix these helixes will not just float around in liquid in the cell Because you're gonna have two me you have rough in each cell you have roughly two meters of dna And by just having that molecule floating around would first it would break and it would interfere with way too many things So what dna in your cell does is that it tends to be wound up around small proteins helical proteins called histones And once the the combination of dna and this protein histone is called chromatin It's not a separate molecule or anything. It's just that we call it chromatin So why on earth do we have a separate name for dna and a protein? Because we discovered it the other way You start looking at your chromosomes. You can find the chromosomes in the microscopes, right? And you can somehow your chromosomes consist of something You don't know exactly what it consists of so you call that matter chromatin And then of course decades later we found out that the chromatin is the combination of dna and these proteins We didn't know what the protein structure was at the time But you have more than one protein here you have one two three four proteins here and If you have multiple turns of this you're actually going to have eight proteins So the real chromatin structure is that you have the dna Comes in and it's wound around roughly four chromatines And then you have a second turn with another four chromatines And then you have a small link your region to this next bead of eight chromatin proteins Or sorry wrong direction The size of each of these beads in the cup in the ballpark are a few nanometers say 10 nanometers or so This is the way the dna exists in all your cells right now forget about that stretched out dna doesn't exist Because most of your cells are not pretty much no none of your cells are two meters long And this continues on many different devils So you have you end up having this chroma this chromatin bundles of histone proteins They tend to be packed in some sort of fiber structures that are might be 30 40 nanometers across And then you have these extended fibers of this and these are then condensed in some sort of super super helical structure Which is your chromosomes So your chromosome only consists of dna and the system proteins So you can imagine that it's kind of compact you wound it up into the entire chromosome right Can you see any problems with that? How many base pairs is there in your dna in yours? body roughly Three billion And of course that's spread over multiple chromosomes, but you know three billion is a large number So let's say about a billion per chromosome or so well a tenth of a billion maybe 100 million per chromosome Can you see any problem with that? You're going to need to find that tata box And i'm not sure about you, but if you're in a library with 100 million books on the shelf And you need to find four of them in order and they're all wrapped up now into one big coil How on earth you find the tata again if you're not talking about finding it there right? First you need to go there you need to unwind this you need to get the condensed section up You need to get this section up you need to get the chromatin up you need to get the beads up And then you need to get into the dna And this happens tens of thousands of times per second in yourself The second problem here is that in this so-called library viewers you have 100 million books The only problem is that all your books are connected together in one string And now you also are going to have roughly one million students at the same time in the library pulling out books And they're all connected with the string Can you imagine what would happen? Things would be completely tangled up and the string would wrap across itself You would have knots and nothing would work after a while Yep Well, how do you think it works? Well, how do you decide what's packed and not packed? Well, the histone isn't smart Aren't the very reason that we didn't know We had no idea how many three-dimensional structures the when we talk about the three-dimensional structure of dna You tend to see that picture right We don't see these pictures and the problem is that this is this is not like a protein Even even the chromosome you can see in a microscope of course But on all these levels it's not as well-ordered say as helix secondary structures. So you can't deter there's not one unique structure It's flexible and it will move all the time So there is of course just as you have order in a lipid bilayer that you saw in the lab yesterday You can't determine the exact confirmation of each lipid But there is definitely order anyway, and it's kind of the same problem here There is order here, but we can't determine what it is. So it's a black box for us There is way more order in three-dimensional structure of dna than you think But just because we haven't determined it. We don't we literally don't know what it is So the problem is that if this is the first approximation you can think of this chromosome with everything wrapped up in a bowl And I don't think it's going to be easier because it's stretched out. It will probably just become more entangled It's not an easy problem to find this tata box in a gigantic folded ribosome Even if you could do this once even if it was true that your particular gene would be unfolded if I Imagine having this piece of yarn and I start to pull in the red fiber And then you start to pull in the yellow one and you start to pull in the green one What's going to happen? Like it's they're going to be completely interlocked, right? I won't be able to pull out the green the red And I'm going to scream to you and stop pulling in the yellow one. You're disturbing my transcription here And in the meantime, you're going to say the same thing about the blue The light blue one and everything so that It's not going to work. You can't have a structure like this because everything would be Interfering with everything else and yet that's for a long time. That's how we thought of Chromosomes that is kind of random. It's not random So the problem is at some point you need to determine dna 3d structure How do you determine a 3d structure? You can't do this with cryium because it's not ordered enough You can't do it with x-ray. It's not ordered enough Well, if it's not ordered enough, you're going to need to go with some sort of lower resolution part, right? We can't determine the exact coordinates of dna But can I can I create some sort of maps so I know what parts of dna are closed to other parts of dna? There was a really cool technique developed a few years ago by Eric Lander and Yerit Sliberman Aiden called High Sea And I'm going to describe this technique to you not because you need to know it But I think it's a beautiful illustration of how modern high throughput research works So they decided very early on to completely give up on structure. They're not going to do this with normal structure methods So what they did is that they took dna and then they introduced mutations on specific sites And of course if I introduced mutations I can create binding sites between my two strands right so that they bind to each other Like I basically I crosslink dna Now once I know once I've created a bunch of crosslinks in my dna I can also use restriction enzymes to cut off the pieces of dna I'm completely killing the cell of course, but that's intentional So I now have really really short strands of dna maybe 20 to 100 base pairs and Then I can fill the ends to make sure that they don't keep extending and then I get these to pair I ligate the two ends together. So I now have very very small Closed vectors of dna think of this as maybe 50 to 200 base pairs Once I have that I can use next generation sequencing techniques This is kind of similar to the libraries you build up in next generation sequencing I can tell you a little bit how that works. So once you have this I can basically use pcr and create a billion copies of that And then I sequence it Sequencing is cheap and fast super cheap and super fast But since I know both the red and the blue part I know where the red part was in my dna and I know where the blue part was I know that okay, these two parts were close to each other Not just not close to each other in sequence, but they were close to each other in structure Because this would only happen if they were reasonably close to each other in space So now I can suddenly say that you know this part of chromosome 14 was close to that part of chromosome 14 I have no idea what the structure is for chromosome 14 Yes, but the problem is the red here means that the things are close to each other in space And there are lots of things secured along the diagonal. That's not so interesting But the interesting thing is that so the the end part of chromosome 14 here is apparently sorry No, the start part of chromosome 14 there is relatively close to that end part So there's lots of things outside the diagonal here that are red So I'm basically starting to create a very low resolution map of what parts of the chromosome are Close to other parts of the chromosome They have been able to create the resolution of these spatial proximity maps is roughly one million base pairs And if you just do the math from one million to three billion It doesn't sound that much, but just diagonalizing and working with these matrices requires a huge amount of computational power And you need to do it for every chromosome independently So why on earth are they doing this with sequencing? You think a part of the fact that it's a good idea and worked well If the resolution here is roughly one million base pairs In each such plot I basically have 3000 squared data points So I have nine million data points because even the white points The white points are data points too. The white points tell me that these two parts are not close to each other So within one, roughly one high throughput sequencing experiments here, I get nine million structured data points of each chromosome And I think it's a beautiful example. One of the reasons why sequencing is so important is not necessarily that you determine genome sequences That sounds stupid. I know but Sequencing is cheaper and faster than any other method in natural science when it comes to producing information about biological systems And this is an example any other problem that you can convert into a sequencing problem you can solve So remember our original problem here that we wanted to understand spatial proximity of DNA structures, right? This is a really difficult problem But by because we found a way to turn this into a sequencing problem We could solve it cheap And I think this is something this is a way you should try to think of things Don't necessarily think of sequencing God knows there's going to be more say that once we've determined 100,000 genomes people are going to go to determining One million genomes and eventually they're going to determine one billion genomes and everything There are certainly things you can learn just by determining more vanilla genomes But think of using sequencing as solving Problems that are not necessarily intrinsically sequencing problems. Yep No, maybe but remember that we're talking that we talk with the resolution here We have is roughly one million base pairs, right? So that we're talking about a relatively coarse resolution I don't worry about the next loop of DNA. We're talking about something one million base pairs away So it's a first approximation Yep So I'll get to that what what we were what they were able to do with this information So they were able to build very coarse models of what parts of DNA interact with other parts And in particular they were able to prove that these so-called if you just take a chain Imagine having a chain like water that you freeze That's not how these things work DNA structures end up being very well ordered and something you call the fractal globular Do you see that you have the red part in one area the green part in one area? You can actually this is a bit of a curiosity But there is a mathematical or physical concept that is called the Hilbert curve So it's a simple one-dimensional string that you start painting not just in two but even three or more dimensions and it never crosses itself so that You start then we don't know why DNA is ordered this way But we know from the previous experiments that you have these large patches of DNA that tend to be lumped together And that's almost what we see That just then to have these patches This is not very high resolution But you start having patches that are very close and then they mostly only have local interactions so what They were able to deduce from this is that because we have these fairly well-ordered globulae What likely happens when you try and scribe is that when you start pulling here Do you see it suddenly I can't pull out just the green part Without necessarily influencing the red part or the yellow part So this took the world with storm a few years ago This I think it's in 2009 where the first published it and you have a bunch of physicists now who are now going after this So the two really superstars in this area is one is Jairus Lieberman-Aiden continually on that and the other one is Lea de Mirny And I like to have modern information in the course, but it's kind of cool because this was published yesterday in nature It's actually not an original for this is a you remember that I talked about these reviews or news and views So this is one of their staff writers, Eli Dahlgin, who wrote a little about about Leonard Mirny's work And what they have now been started investing is exactly what you asked how on earth do we fold DNA this way? Why does the we can we can say that DNA must be ordered this way? But what determines the ordering just as what determines protein folding? and This is so not resolved But they have these fancy new ideas where you have small cohesive proteins That basically help form these long loops, but at some point you're going to have some other proteins that Stop the loop so that they don't really grow more and then it depends on which direction these proteins are sitting So you're basically going to create some sort of local loops that are large enough to pack locally So that you don't have the entire 2 meter protein as or 2 meter DNA packing at the same time I've uploaded that paper to The mondo survey if you want to look at it's just two pages and it's It's not necessarily something we're gonna it's not something we're gonna ask about at the exam But I think it's a beautiful way of very easily accessible papers in nature. It's not an original research paper It's a super easy paper to write. I tend to read these every week And if you want the original data the original data is so much older Because there was in february this year There were some of the first experimental confirmations that these models are likely correct So I think that was published on february 3 I bet within five years all this will have changed because this is really where the research front is when it comes to DNA structure today So these cohesions are small proteins, but we have literally no idea whether it's a single or w loop and everything that people are fighting about this We don't know this is This is so not true This is on the model of Some people think in this case. This is I think it's leonard mirne's model. Here is liebermann Aiden doesn't agree And that's sorry that's the problem with the research front the closer you to get to the research front the more people disagree But the cool thing is that we're finally starting to learn something about three-dimensional DNA structure And there is much more dea structure. There is much more structure and order in DNA than you might think from those simple chromosome pictures But I the other way to think about this is that science is difficult because read science happens on the white spots on the map, right? It's so much easier You might think that some of the things we talked about the ampoules and proteins and everything It's obvious that helix should look like way But of course in the 1950s simple helices and sheets were this way people were fighting They didn't agree with exactly what the secondary structure should look like experiments were pointing in different ways So this is really how knowledge forms when knowledge forms where we don't agree And then eventually at some point we do agree and that's when it goes into the textbooks But this is not going to go in textbooks until five or 10 years from now And then again, this model might very well not be correct But it's useful So the other part is RNA RNA also has secondary structure and in fact RNA has way more secondary structure than DNA Have you seen some of these plots with RNA secondary structure? Why does RNA have so much secondary structure? partly but also That's of course correct. Well, that's the functional reason why it needs it But the structure reason why it has it is that it's a single strand, right? So depending on how it pairs up it will create different secondary structures It is surprisingly easy to predict RNA structure And that's because virtually all these interactions are close range The only question you might be 10 residues away But you never have any of those long range interactions that you have in proteins virtually never at least So when it comes to RNA structure that is floppy and flexible, this is way easier to predict than protein structure This is something we do computationally all the time Well, I don't because I don't work with RNA, but It's not hard to predict RNA structure The other difference with RNA is that RNA is super fragile If you have any if you work with RNA in the lab So what do you do with RNA in the lab? You keep it on ice And that's because it's so fragile that even room temperature will start to degrade DNA That's our RNA That's incidentally why Rehband when we do this is that we tend to work with DNA DNA you can ship it all over the world you can leave it on the counter It doesn't matter if nobody forget to put it in the fridge over the weekend DNA will stand absolutely anything The only problem with DNA when we have our cells if we're working with If we work with RNA you just inject a little bit of it in the cell And then it will find its own way into the nucleus If you're working with DNA we need to inject it all the way into the nucleus to work But it's far easier to learn to inject things in the nucleus than to work with RNA. So that's why we normally work with DNA The other part is that if you just look at the difference of the sugar here The RNA has said hydroxyl group here and that hydroxyl group is fairly reactive too And if that wasn't enough you have ribonucleases This is a constant pain in the lab Again, if you've ever worked with RNA, you know this but for the ones who don't How So your body has right RNA says ribonucleases to degrade RNA because we occasionally need to degrade it in the cells once you've used it The problem is that these are virtually indestructible RNA is not RNA is easily destructible, but the RNA's are If you if you autoclave it it survives It's alcohol anything it survives it survives virtually anything So what you do in most labs you have one lab bench where we paint it with yellow tape and say RNA work only And that means that if anybody If you don't if you as much as touch your filthy fingers full of RNAs and lab bench the person responsible of the bench will kill you Keep that lab bench clean from anything but RNA work and when you work with RNA You should wear gloves and everything you might think that you're wearing the gloves to protect you You're wearing the gloves to protect your precious DNA You can they're always more fingers, but the RNA is precious The better any time you work with DNA go into super careful mode DNA will stand anything RNA is pretty much the opposite if you even think about RNA you will start to degrade it Then we have multiple types of RNA. That's the other part of the thing In principle, the units are the same, right? But you've probably heard of messenger transfer and ribosomal RNA Let's start close to the protein. The main thing to understand about messenger RNA is that it's on the messenger RNA level That we get rid of all the introns So when we're reading the DNA in particular in a eukaryote We're literally making a copy of the DNA with all the introns so-called junk. It's not junk. They regulate things But in the sense of synthesizing a protein the introns are junk So on the messenger RNA level We have the pre messenger RNA that we just cut off all the introns to get the actual protein coding written And that's the reason why we need a separate molecule for this So what do you know about the structure of messenger RNA? So this is the one structure you don't see because it's on the messenger RNA level. This is just a small floppy chain I don't even remember ever seeing any messenger RNA structure. Sorry The messenger RNA will then move things over in particular to the transfer RNA And the transfer RNA is what you have in the ribosome. The transfer RNA will bind an amino acid. Actually, let's look at the molecule up here The transfer RNA up here in the yellow or gray part up here, you will bind an amino acid single amino acid And then this entire carriage down here you have one two three base pairs. Which is the anticodon So that's the anticodon. That's the complementary Again, if I have a a a for lysine or something, I should have u u u here the opposite That's just match my amino acid up there So each such transfer RNA molecule will come with one amino acid molecule bound and then it has a particular pattern here That binds to the messenger RNA in the ribosome here And then they polymerize and then the tRNA is now empty and it leaves the ribosome and it goes out And this is an insane machinery You also have to make this happen and to make the messenger RNA stable in the ribosome and to gradually pull this through the ribosome The ribosome is not really a protein by definition. The ribosome is more RNA than protein So you have huge amounts of RNA And nowadays we have pretty good secondary structure maps. So this is just an E. Coli, which is a dirt simple cell And just the large subunit the one on the left All this is the ribosomal RNA in it pretty complicated and for a long time we were Pretty uninterested in RNA to tell the truth. It was just a passive messenger If anybody ever asked you about the chicken and egg problem, we would say that DNA obviously came first and then RNA Just delivered material. This has changed completely over the last three four decades So today we even there is even quite a lot of well I would even say there's probably close to consensus that RNA came much earlier than DNA So in the very first organisms, organisms that are early won't be had out, you had a pure RNA world Because RNA is capable both of carrying genetic information And reproducing it. DNA can only store it. It can't really do anything with it But of course for all these other problems because it's fragile and everything I guess nature over billions of years Eventually will DNA as a better molecule to store genetic information because if you're If you store your genetic information is something that's very fragile the organism is going to be very fragile But the molecules when it comes to these very early experiments where you think that you can do light discharging and spontaneously create amino acids and everything from simple organic compounds In those experiments you basically rely on RNA forming first I think that the first a large part of this you can probably do with secondary structure prediction And the second part that you could do cross links and everything actually I don't know how this particular data was determined, but remember for when it comes to RNA I would largely trust secondary structure prediction. It's really good but it's also For a protein it makes sense to understand the structure, right? If you look at this, can you understand this structure? No, and that's also why it's that's there's a reason why I don't know Determining the structure is interesting as a scientific exercise But you can't understand what this structure does the way that you can understand that a couple of helices create a binding site for oxygen in hemoglobin And another thing that because of this way that RNA can both store and start to interact with things RNA can also start to interact with genes and this became very popular a few years ago That we realized that small RNA fragments can silent gene expression by binding specifically to some messenger RNA molecules uh actually So this image is an example of is that you have a small plant here And then first you have this is one type of gene you've silenced So all the g all the white areas here is no longer expressing color And then the same plant when you have silenced another gene So by target the silencing different genes here, you can change properties of what genes are expressed in the plants The way this works is that You have a small you have a piece of DNA And then you get a very small piece of this DNA that is Turned into RNA that you introduce it in the cell And then at some point this RNA can bind to messenger RNA molecules But if you're now binding this messenger RNA molecule existed for a reason, right? That messenger RNA molecule was carrying some signal that was supposed to go to the ribosome So you're no longer silencing the gene in the DNA But you're intercepting the messenger and preventing it from going to the ribosome The reason why I mentioned this was a fairly fairly recent work They was published in 1998 by andrew fire and craig mellow and is what you call RNA interference This was super hot in the mid early to mid 2000s because this was the one way What I what you can basically do with this plot Rather than having the cell do this themselves. You could do this in the lab, right? So we figured we might be able to use this as a way to deliver specific Ideally deliver specific pieces of RNA and introducing genetic material Or at least in a very targeted fashion turn off some things in your genome that we might not want to be produced anymore So for a few years, this was the big hope that we thought we could use this for gene therapy And they got the Nobel Prize for this in 2006 But I think it's an also an example that it's this is one example why Nowadays we tend to wait a while before we award Nobel Prizes You don't hear a whole lot about RNA interference today because just a few years later There was something else that appeared crisper cas This is the world's most stupidly named system So crisper is just one of the stupid names the clustered regularly interspace short palindromic repeats This is the way bacteria defend themselves against viruses. So what bacteria do Grossly oversimplified here if you have a piece of foreign DNA That comes in What the bacteria does is that it well, it doesn't recognize it But you have molecules that recognizing this there and then the bacterium takes a piece of this DNA and inserts in this crisper regions of its own genome And now we express our own genome and the bacterial genome is super small, right? So you don't have a whole lot of unused space in the genome And that's why a bacterium doesn't have any immune defense We can't afford to have thousands of proteins dedicated your immune defense So we're just inserting tiny regions of 10 20 30 base pairs in our own genome That now means that I have a bunch of these I don't have the entire genes and I don't have the entire promoter regions and everything But I just have small pieces of your horrible your horrible foreign DNA that I can express So what now happens later on when this foreign DNA comes in again I can basically bind to that foreign DNA and intercept it and make sure that at least your virus doesn't start producing protein in my cell in my bacterium Again, this is grossly oversimplified But this so first it works It's a super cool proof of how the immune system in bacteria works And the Emanuele Jennifer Doudna at Berkeley and then Emanuele Chalpentier who as the time was in Umeå in the north of sweden Now at max bank Institute, they discovered this in 2012 And initially this was entirely based on understanding how the immune system of very simple organisms in particular bacteria work But they were also smart enough to patent this Because they realized this is going to be the super cool way of doing gene therapy in the future because you can now Use a very small plasmid to Transfect a cell so you have this cast 9 pro CRISPR cast 9 system And then you have a very small piece of DNA that we would like to insert and introduce so This guide RNA that we have will bind to certain parts of your genome that we would like to edit Just the way we're going to bind in the bacterium and then we will In a targeted fashion, we will open up this part and then we basically piggybacking on your normal all your normal enzymes and everything That you're now going to cut this DNA And if I have this my new DNA that I would introduce in exactly the right place I basically have your normal repair mechanisms in your cell Is now helping this super small RNA pieces in my code insert my replacement gene here So this is a super efficient really cool way of performing gene therapy So do you see the difference that? Turning off an existing gene that you're producing too much of is one thing. It's certainly useful But here I can introduce an arbitrary new gene in your genome We're not talking about overexpressing a bacterium in a petro dish that we don't have to throw away I could do this in a patient So if a patient is now deficient in say Hemoglobin is a bad example, but I'll use it anyway I could introduce a new hemoglobin gene and then we use this bacterial system to insert this And the genome in each and every single cell in your body You can probably imagine the amount of things that could go wrong with this right A few years Later there was a group led by shang at mit that manages to get a patent for this specifically on humans This has now led to one of the largest patents battles that's happening in the scientific world that everybody's suing everybody Do not shop on tier arguing they prove the entire mechanism. It's kind of obvious that if you can do it in the bacterium It's going to work for a human too Uh mit on the other hand they're clay and the abroad institute and eric lander They're claiming that it's not at all obvious. It's completely different We're talking about the tens or hundreds of billions of dollars potential revenue here everything about modern gene therapy It's going to be based on that and I have no I suspect what's going to happen I suspect that both patents are going to be valid the companies are going to keep suing each other My prediction is that only dudna and shop and t are going to get the noble price for this within 10 years And I shang will not be included Because that's a that's the noble committee. Remember what I said the noble committee occasionally sends these implicit messages They don't get a motivation. They can't award it for up to three people. I bet they're only going to award it to dudna shop and t That's pretty much everything I had One minute of time. Ha this pure coincidence As I said Spend a couple of days studying. I don't have any study questions for this because we won't meet until next Thursday Do mail me if there is stuff that many fewer questions about I might record a small piece and put it online or just write some things Otherwise next Thursday the entire morning. I will So and this is important. I will show up here at nine If there's nobody here at nine, I will leave again at nine ten and happily head home and assume that you didn't have any questions