 All right, since there were three brave souls left here in the room, I said that we might as well go through the study questions and record them here. So you can't see the study questions there, but... Well, you had the study questions here and the people who are seeing the recording come once later. Number one is what is the difference between accuracy and precision? So it was with precision that your values are centered in some region, but you are not sure if they are... if it is a good region, but you are just getting very similar results. Yes, so that was said here, and you probably heard the recording here, but precision really has to be... how repeatable are your values? I might be a lousy shalt, but I'm lousy in the same way every single time. I always aim too far through the lower right. And that's actually important because it might sound strange that it's important to be bad occasioned, but systematic errors are usually fairly easy because we can deal with them. We can adjust for them. And again, the obvious thing, if I know that I'm always aiming too much to the left, what do I do? I start to aim slightly to the right of the target, right? And then good, then I'm going to be fine. So the other part here was accuracy, and accuracy then was really... But an average of your results corresponded to... So, accuracy has to do with the spread. And then, so basically, accuracy has to do with the standard deviation. If you have a very high standard deviation, right, that values can be all over the place if you just draw one. The difference here is that as you start to draw more and more values, as you make more predictions, if my accuracy is good, the individual... sorry, if my accuracy is good, then we mean that on average, if I take 10,000 samples, the average of them will be really close to the true value. An individual sample can be completely off the chart. So the point is that if I just draw one sample, actually, if you just draw one sample, you can't even say anything about accuracy or precision. One important, even if I don't know what the right answer is, you can say something about one of them. Accuracy or precision, which one can you say something about? Precision, I guess. Right, because the precision we can get just from how distributed our values are, even if I have no idea what the right answer is. So it turns out that precision is actually fairly... many of the things that we go through here with standard deviations and everything, that they're easy to deal with precision problems and the spread that are systematic, sorry, knowing that when do I have enough data so that I know this value fairly precisely? The systematic errors, on the other hand... Well, the problem with systematic errors is that by far the worst ones are the ones we don't know we have and that they're very difficult to correct for. If I know that this thermometer is not calibrated, of course, I can go and get another thermometer. But maybe we have seen that in cryo-e-in, for instance, that we get worse results when we record in late summer in the test of imageability. And you don't really want to wait six months before you record it. So it's not just that they have slightly different properties. It also turns out that they're harder or easier to correct for. On the other hand, if I have a fairly good idea that I at least have high accuracy, but I know that my precision is lousy, how do you fix that? You get more accurate information. No, that's the problem. The accuracy means that on average, the results will be good, right? But it's just that it's a shotgun swarm. The individual results are all over the place. Exactly. The data you can fix by just collecting more and more data, and that's also what we started doing in cryo-e-in. Remember, the individual results were super noisy, but by collecting more data, I can compensate for that. But this, of course, is important. It's to realize that whether you have a problem with your precision or your accuracy, you're going to have completely different recourses to fix it. So what is the expectation value and how is that related to the average? The expectation value is based on the probability almost of how things will average over many number of tries. That's a way how we approximate it, but there's a better definition of expectation value. So let me give you an example. You typically talk about the stochastic function or stochastic process, and that is an underlying process that actually does exist. So what is the average length of males in Europe? But the key thing, you don't know. But there is an average length of males in Europe, whether you know it or not, right? And that is the stochastic process here. There is an answer, it's just that you don't know it. Same thing there. If I were to measure, if I have 500 samples upstairs and I'm going to measure the concentration of salt, there is, or that's a bad example, because that's a few samples, but there is some sort of underlying process. What is the decay rate of uranium per second? I don't know, embarrassing to realize, but I would have to look it up. But the point, there is a value. There is a value to the process whether I know it or not. And in general, of course, in life science, the problem is that you don't know what the value is. But even if you don't know what the average length of male, what is the length distribution? It's not strictly going to be Gaussian either. So that there is some sort of value in the middle here. We have no idea what it is. And this is the value that I call expectation value. It's the value that you expect to get in the limit of infinite number of times. And you somehow draw this with a square brackets or something too. This is almost like saying that it's, we write this in a way as almost as if it was a function or something. So that this is somehow my stochastic, my random variable here. So that this is the average value I expect to get after an infinite number of tries. And I'm not sure how much time you have, but I don't have an infinite amount of time. So you're not going to draw an infinite number of values. So that if we would like to calculate, say that the average brightness of whatever some salty water, what I can do, I can collect 10 samples or 500 or 5000. And of course, the more samples I collect, you instinctively feel the better you're going to be able to estimate this value, right? So that, but that is an average that the actual average I take from 500 actual samples or 50 billion. I use this average as a way to estimate my expectation value. And in some cases, maybe you have a very smart way to account for systematic errors and everything. So you might have a fancy way to try to estimate your expectation value. I might for instance know that let's take a thermometer. And I know from experience that this thermometer is bad. It always showed two degrees too much. What would you do? You would remove two degrees from your measurement values, right? You would not strictly then you're not strictly taking the average of your measurements. You're correcting them because you know that that way you can get a better estimate of your expectation value. So the expectation value is the hidden thing we're after, but that we can't measure. The average is what you actually get from your experiments. And that's why you try to get more data with more data. It's not really an average from 500 samples. It's just an average. Just as an average from one sample is an average. But the average from one sample is not likely, not going to be a very good estimate of the expectation value. While the average over 500 samples will hopefully be a much better estimate of the expectation value. So then we had, that was the expectation value. So what's the standard deviation then? It's how much it is normal for a sample to deviate from this. So again, now we are again, do you see the correlation here? Now we're again talking about, so what is the average? How much are the length of males in Europe spread? How likely is it to get something that's far below the average or far above the average? So there is some sort of standard deviation to this distribution. I have no idea what it is, but it must exist in the distribution. And unfortunately, this again goes a bit beyond the course. We can estimate this by drawing samples and just seeing how separated the samples are. And that means that there are actually two slightly different standard deviations. One standard deviation when I know, when I actually know what the expectation value is and another one when I try to estimate this from the samples. So you might have seen in some books or mathematics that you occasionally subtract with n minus one. Occasionally you divide by n, and occasionally you divide by n minus one. And that has to do with these details. We're not going to, I'm not going to go into that in the course. But the same thing here, that that is a property of the actual distribution. If I draw one sample, how likely is it that it's going to be far below the expectation value or far above the expectation value? And that is to every time I draw a sample, and that's sorry, draw a sample is usually the formulation we use in mathematical statistics. And the draw a sample is to make one measurement, get one piece of data. And if I draw a sample one million times, I'm repeating this process one million times. But every time I do the experiment, basically when I'm doing this the one millionth time, how is that different from the first time I did it? Nothing, right? I can do it 100 billion times. Each individual sample still has the same distribution. So that the standard deviation, again, we try to estimate the standard deviation by calculating it from the samples. And if you just have one sample, you can't say anything. You need at least two. But again, you can probably imagine if you have five, your estimate of the standard deviation is going to be pretty lousy. If you have five million, you're probably going to start to make a pretty good estimate about how this is distributed. And then the final part, which is standard error, is what? For confident you are in your value you get. So here's the first thing we said, we had expectation value on the other. And that is the true value, the true average here that we don't know. But then let's say my average this far is this. So this is x average. Because again, we didn't have an infinite amount of samples. So that in general, you're never going to be exactly on this line. And the obvious question you want to ask yourself as a scientist, how close am I? Or how close is this reasonable for me to be? How close do I expect to be? And if this is a very broad distribution and you only have three samples, you know that you're likely going to be very far away. But if you've taken five billion samples, you hopefully know that you're going to be much closer. But how much closer? The fact that you're closer doesn't really help. Because if you're an advisor in a few years when you're a PhD student, ask you to get this and I need an accuracy of say one kilocalorie per mole. Should you do five samples, five hundred samples or five million samples? It might influence your summer activities a bit. So that the point is not just having a general idea of the shape, but being able to realize how far away is that. And that is really what we describe by the standard error. How good is the red line here is my current estimate of the expectation value. And the standard error is my, how good do I expect this to be? So that I think that the standard, I think that the average length here is a 4.2 plus minus 0.2. And then I'm essentially saying that the likelihood that it's much more than 0.2, this too is of course a probability distribution, right? So what is the probability that it would be say three when I said 4.2 plus minus 0.2? So that's, I'm not going to confuse you more there. But basically the standard error describes how accurate we expect the average to be as an estimator of the expectation value. So that brings us on to the fifth question. How do Sigma and S vary with the number of samples? And why? Sigma doesn't vary because it describes the underlying distribution of the population. Yes. And so Sigma describes the underlying property of the distribution. And in particular, it's a property of a sample, right? If I pull one sample from this distribution, how do I expect it to be? And by definition, if I pull one million samples I'm just doing the experiment one million times. It's still a property of each sample. S on the other end, how does that change? And I don't think we prove that and we don't believe it. But the point of why does, so S drops roughly as Sigma divided by the square root of N, the number of samples. At least if they're independent. And that, so the first, why does it make sense that S drops as N increases? Because we have more and more data, right? Yes, but what does more data help you with? To get a better estimate. Exactly, because we were estimating the blue line here from the average of the red line. And of course, the more data we have, the more number of data points we have when we calculate this average, the more confident we are going to be that the average is close. And this way, we can actually calculate how good it is. And the obvious thing is that if I know the properties of the distribution, maybe I might have some way to estimate Sigma, then we can actually say how many samples would you need to draw here to get one kilo calorie per mole. And that tells you how much data you need. There is a curse and a blessing here. What is the curse? That you actually can change the number of samples so we can improve something. And then that's the blessing. That's probably the, well, but both of them are already. So the curse is that if you need 10 times better data here, you need 100 times more sample. If you, by the time you realize, oh, I just need a factor of 1000 better data here. You need one, in fact, no, we just want to, you need a factor of one million more data. So the problem is that it's very, as simple as it looks, when we first looked at the accuracy and precision, right? Oh, no, accuracy is not a problem. I will just collect more data. If your accuracy is a factor of 1000 too low, you're going to need a factor of one million more samples. And at that point, yes, I would probably think of looking up a better machine that provides a better measurement. Because it's superficialist trivial. Just add more data and in practice, the square root kills you. It's virtually impossible to collect. If you don't see anything at all, and it's sadly, it's too fraud. If you don't see anything at all, the likelihood of seeing it just by collecting more data is sadly very low. Because if you don't see anything at all, you'd likely, a factor of 10 is not going to help you. If it's just noise and that first data you're looking at, a factor of 10 is not enough. So you need maybe a factor of 100. And then you need 10,000 times more data. The good thing with this is that for the few cases where we actually do see something, again, for anything like it, learn to use it in your favor instead. The point is that this does not vary very quickly with that. So you use it, you can frequently get by with a much lower end than you might think. So if you now, you did this once and realized for this type of measurement, the data, the signal is usually pretty clear. And you usually collect 100 data points. But now you're in a bit of a rush because you're going to defend your thesis in six months and you really need to get this paper out. What could you do? Reduce number samples. You can probably reduce this by a factor of 10, right? And if you reduce by a factor of 10, it's just going to be a factor of 3 in your signal to noise ratio. You're actually still going to see the signal. And you just save a factor of 10 in the amount of data you need. So let's see. That was Sigma and S. But again, that's superficially, it's a super small difference, but you see that how they're behaving completely different ways. Which is the one that you see used in papers all the time? S. Well, it should be S. The sad part is that half the scientists have no idea of the difference between S and Sigma. So they will write Sigma. It is wrong that you're completely right. You should always quote S. But they just see some, yeah, there's an average and a plus-minus something and they think that they can put the standard deviation there instead. And you're laughing, but trust me, the fraction of papers where people just do it, it's sad, but it happens. But that's why it's kind of important. And you will even see it in papers when you say you will see the figure legends and then you will see that data points are reported as value plus minus SEM. You remember what that's done for? Standard error. The standard error in the mean value. And I think that's partly a way to prove that you actually know what you're doing and you really have calculated the standard error in the mean value, not just a standard deviation. But so the only way to change the standard deviation, it would be like to improve the equipment we are using to get something. Well, not just the equipment, right? So for instance, we collect a lot of data on Progex. How reproducible do you think one Progex is to the next? And it depends if you're getting the, if I get the eggs I got this time, they might have been harvested from the same frog. The eggs I get next week, they're going to be from a different frog. They might have been harvested by another person. So it's not just a matter of buying a new, in my case, buying a new microscope, right? But when I have, we frequently try to do one batch on the same Progex. And ideally, if I'm going to make two samples, maybe you can try to make both measurements on the same egg. You can't make too many measurements on the same egg because eventually it goes bad. But try to remove, try to remove any systematic or unnecessary changes that you do between experiments. So yes, technically it's noon and you could go and have lunch and come back and do the other experiment after lunch. Or maybe it's worth to do that second batch on the same cell where you're sure that nobody else touches the microscope. Because after lunch, who knows that you might be sitting in a room that's southern facing and there is sun and suddenly it's three degrees warmer in the room after lunch. Will that influence your results? It very well might. So try to reduce, I mean that can of course influence both the precision and the accuracy but try to remove anything that changes or anything. So yes, giving a better experiment is always nice but you can do more than you think just manually avoiding anything that changes in your experiment. So what is p-value hacking? So you did several measurements of things that you assume that are unrelated but they are in fact related to each other. Yes, so no. So I would even, it's even simpler. So what is, what do we mean by statistics and you're familiar with e-values and bioinformatics, right? And what does say an e-value of 0.01 mean? Is that we are 99% confident? Yes, rather, well slightly. You could say that your 99% confidence formally said the likelihood of getting this result by chance is just 1%. Which is of course low and fine in most cases. But the point is if you go around and look for results sooner or later you will see outliers, right? You will have an accident happen on Friday the 13th eventually. It doesn't necessarily mean that there was something special with the Friday the 13th but it happened on that Friday and then you noted it because it was Friday the 13th. That doesn't mean that it's actually it's a safer day in general because people take precautions. But if you now take, if you now do 10 experiments here you can actually calculate this properly. So the likelihood, let's see, let's actually do the exercise what we had in the lecture where there was generally being sort of something. So there was only a 5% chance of something happening by pure chance. So the likelihood that this does not happen in the first experiment is 0.95, right? 1 minus 0.05. The likelihood that it doesn't happen in the second experiment is also 0.95. The likelihood that it doesn't happen in the third experiment is also 0.95, right? And then we continue that 20 times and now I can't use my computer but I can use my phone. So the likelihood of never seeing this result by chance in any of our experiments is then going to be 0.95 raised to the power of 20 which is 0.36. So the likelihood that it doesn't happen in any of the 20 experiments is 0.36. Or the likelihood that in one of these experiments, in one out of 20 experiments the likelihood that we saw as furious correlation was then 1 minus that or 0.64, I guess. That's two-thirds chance. So you can handle this properly statistically but this is why you can't and you would probably realize the more experiments like this I line up, the smaller that number will be and the larger that number will be. Sooner or later you're going to find the rare things by chance. So but the point you can't go around for looking for things and when you already found them you decided that was what I was looking for. And so the problem with P-value hacking is really that you're putting the card before the horse. You can't, you have to decide before you test what I'm testing for, what are all the things that I'm going to observe, how many you can, again, will I make 510 or 500 experiments? You have to decide that before you're going to do it. You can't keep making more experiments until you see the results you're looking for. So the P-value hacking really has to do with that and the hacking part here is that you assume that you're only doing one experiment but what you're actually doing you're doing more experiments until you got the result that you were looking for and then you pretended that was the old experiment I did. And you laugh but this too happens in more, not just the paper I showed you, I think we won't be able to deal with this. Well, you're saying you're trying something, right? Then it doesn't work out the first time you try something again. I'm not going to say it's a mortal sin and sometimes it is reasonable you realize oops, I spilled some sulfuric acid on that, it's okay, you could probably remove that cell and try again. But you have to, at least yourself, you have to be aware of this and be critical. It matters how many times you do the experiments because it's statistics. We are all biased. So how do you avoid falling in that p-value hacking track? So you have to first know what's your question and plan the number of experiments you do before doing that. Exactly. And you have to plan in particular the number of experiments you have to plan all the setup and the number of experiments you want to do. And then you're not allowed to change it and this happens now and then even in clinical tests that they make this some sort of ambitious clinical test with lots of participants and then the results come back and they say, sorry, it's inconclusive. You can't, because what they, the question they ask is of course, can we with 95% confidence prove that my new anti-cancer drug is better than the previous drug on the market? And if you didn't have enough patients, the result might very well be well based on the number of, so we can't say. And then you're going to need to start over from scratch. Do an entirely new round because again, because you can't decide that as long as the result is negative I will keep increasing the number of patients. But if it's positive, I'm happy that I'm not going to try to increase the number of patients. Because then you're, then you're instantly biasing yourselves that you will favor positive results. And it's hard. This is why they have professorships in mathematical statistics. So if you're ever going to do a very expensive study, so that there are lots of places where you actually have statisticians and consultants, use them. Go and ask, I bet K.I. has it. And the reason why you have them is that you don't want to come back two years later. You just spent 20 million euros on a study. Because that's when people come with the data for the statisticians that ask for help. The only thing they will say, well, you should have talked to me two years ago. They can't help. They can't, they're not magicians that can help you get valid data out of bad experiments. They can help you do the planning. But if you ignore that and discard the planning, they can't help you after the fact. And I don't expect you to be a mathematical statistician, but that's part you have to understand. There is a limit to your knowledge. There is a limit to my knowledge, too. If I had to do a very extensive experiment, I would ask. And then we moved on to DNA. What were the structural components of DNA? The phosphate group, ribose, I mean the pentostrug and nitrogenous base. Is it? There was an important area. You had nucleosides and nucleotides. One was with phosphate and one was without. And nucleoside was without phosphate and nucleotide was with phosphate. And I'm so not going to tell, well, they might have heard you. But if you didn't, if you don't know the difference with nucleoside and nucleotide, this is the time to look it up. Because if I tell people, they're just going to think that they know it. Again, look it up. You have to know it. It's a boring stupid thing, but it's so embarrassing if you make this mistake five years later in your career. Chromatine, histones and nucleosomes. So histones are just proteins that stabilize the DNA? And how does it stabilize the DNA? It binds to it, so... Then the charge. So lots of things relate to the DNA use charges. Why do they use charges? DNA is charged. DNA is very negatively charged, right? Due to all those phosphates. So it's easy. You can even change the structure of DNA by titrating it with positively charged times. So the same thing with histones. The always way to get DNA to attach is to have some charges on the surface. And then it loops around the histones. And what do you then get? Then we get a chromatic. Where it's first looked at. Inclusions with other beats. You get these larger beats, right? And then eventually you add a layer of layer and then you add up with the chromatine. And it's a chromatine that eventually creates your chromosomes. So how is RNA different from DNA? Doesn't have to be 2-primal H. And what does that mean for RNA? It's more flexible. It's much more flexible. 2-primal H. It's a D. The name of the molecules is pretty telling. The DNA is a D oxyribose. That tells you something about what is missing. And RNA is much more flexible. And that's also why you have this diversity in the RNA, secondary structure and everything. But DNA is pretty much always, well it is always the straight spirals. You can have an A, B and Z format. We're not going to go through that minor detail there. So what is the stability of RNA? Super low. Yes. And we hate, I think most groups hate working with RNA because you need to keep it cool. You need to keep it on ice. And then you take your small test tube and you're going to pipe it. And what do you do? What are you holding it with? You're holding it in my fingers and what am I doing? I'm beating it. So as I'm holding this tube, I'm destroying the RNA. So you need to learn to grab the tube in the part where there's no RNA. I prefer to keep it on the ice and then they open the lid while they keep the RNA on the ice. And now I extract this in the pipette. But I might only extract 2 nanoliter. What temperature is the pipette tip? The room temperature. It's room temperature, right? So I have a large pipette tip and I have 2 nanoliter of D of RNA that was cold. What's going to happen to the temperature of that RNA very quickly? So you better be fast at pipetting too, right? Inject it right away. So that it might sound so easy just to keep it cold, but keeping it cold through this entire process is a pain. The other problem with RNA is what? Enzymes, the voice calls. RNAs? Yeah, RNAs. It's a very good idea. If there is an enzyme and it's doing something just that ace. So what does the, where do the filthy RNAs come from? Yes, us in particular. And that's why there are some lab bends is where people get really pissed off if you just lean on it or something because you'd felt the hand just destroy the entire lab bends. And again, RNAs, they're difficult to get rid of. So it's not just enough. Well, you basically need to use alcohol on the lab bends, but because the experiments take so much time, we have, we usually have these lab bends where we have yellow tape and then there's say an RNA zone. So basically don't put anything there. Anything you do there, you need to wear gloves and everything. Don't even, it might seem so innocent, right? That I might have my small chemical in a can or something that I've touched because I held the can with the chemical. So what is suddenly on the surface of this canister? RNAs. And then I put this down on the table. Now there are RNAs on the table. And because the amount I work with are so exceptionally small that suddenly there are RNAs there and they will destroy them. And particularly if you're going to incubate this overnight or something so that keep it clinically clean from everything. And the stability of DNA. It's also neural. Better than RNA then. Yeah, so in general, no. I would not say it's low. I would say it's fairly high. But it's not, remember the difference between thermodynamic and kinetic stability? For a very long time we even thought that DNA was thermodynamically stable. And again that you can get DNA from Neanderthals, mammoths and everything. DNA will literally survive for tens of thousands of years but it will be damaged and it will become more and more damaged. So parts you will start to have errors in parts of DNA. There is no way we could revive dinosaurs, for instance, the DNA is too damaged. And that's what we ended up talking about these things. So we even know that so it's actually a good analogy when you say that DNA is less stable in the environment than it is in your cells. Why? So you have repair enzymes, right? Why do you need those repair enzymes? Yeah, but what is damaging your DNA? Replication errors. Well it's not so much replication error, well technically, but I think it's mostly A radiation and B free radicals. So basically a chemical process is very reactive chemicals that for some reasons start to do bad things in your DNA. So suddenly there's a base missing or something. And of course they can of course be replication errors too, but the whole replication machinery is so amazing that it's very rare that they make errors. Radiation damage is the other obvious thing. And that's why you have these thresholds that we talk about, right? That now it's going to sound like I'm the world's best friend of nuclear power. So one caveat here is that when we try to determine what levels that are safe in radiation, how do you determine that? You put a mouse. But you basically, yes, basically what you do is you somehow start to induce radiation damage in a cell until you can somehow measure how damage is something that is something else. Or in the case of animals, you might, for instance, use Drosophila or something and see how much radiation can they take before things go really bad and you start to see that there are genetic defects. So at some point there's some dose D, let's call it dose D here, and then we're definitely damaged. Very bad. So the question is what dose can we allow? Well this might be horrible, but all this has to be based on probabilities. And it sounds horrible to say, can we really, why on earth do we use probabilities when we talk about human lives? Would we do that all the time? Otherwise we could not allow traffic, right? We know that there are traffic there, but there are going to be roughly 300 traffic that's not experienced Sweden too. But the alternative is to basically stop living. There will be subway accidents, sorry, it happens, and there are bike accidents. And there are even people who die from cigarette smoking. So that in the very simple model we would somehow assume that this is linear, down to zero, and then we say, you know what? This is not so bad part. There's one in a thousand chance of dying here, whatever. That we allow, so that we call the safe dose. And then we probably want to be a factor 10 under that or something. So that most doses we have for things that is based on estimates like this. But based on what you know about the DNA enzymes, I would rather say what likely happens is something like this. So when you have a little bit of radiation damage here, right, the enzymes can keep up. They can repair. It's not going to be bad. But then at some point where you start having so many errors that your repair enzymes can't keep up, the slope is significantly worse. Now normally you're down here and then it's good because down here this means that most of our limit values are actually exaggerations. You're going to be better here because we didn't account for the fact that your bodies will be able to. Now of course if you're up here it's the opposite. So up here the radiation damage will grow faster. But this is also why it's not and again that's why you can have this is why it's not dangerous to have an x-ray. Even a chest x-ray is fine because it's not that the probability even if the dose is 1,000 of a lethal dose it doesn't mean that you run one chance in a thousand of dying. At these levels your enzymes can't keep up. It's not dangerous. Even taking in even taking a flight the radioactivity of high level is higher than here. So you will definitely get more radioactivity from flying. And if that was dangerous I would be dead. Never going home again. Plus I flew 200,000 miles last year it's not that you don't die.