 Okay. Please come in. Okay. So, we'll be restarting. So, in the next few sessions, I'll be talking a bit more about inference and learning in biological systems. So, I'll be restarting a bit from what Alexander has been talking. I'll be talking again about information theory, which he's already mentioned. But first, you know, I'd like to make sure we all know the same page when it comes to inference and learning by reminding you of some of the, well, not reminding you, maybe teaching you some of the basics of learning independently of biological systems. Okay. And try to see, you know, how we can use it in some concrete biological examples. But as far with just introducing a, you know, very simple concept, which is bias theorem. Okay. So, this is going to be the easy part of today's class, today's lecture. So, I'm going to ask you a simple question. So, this is, do you recognize this guy? Yes? Okay. That's Sherlock Holmes. So, Sherlock Holmes is actually 40 years old. And at his age, you get a fairly high probability of getting prostate cancer. So, yes, I mean, he's that old, he has that risk. All the numbers are going to give a fake. So, don't take them seriously. They're just to make the point. Okay. So, now, you know, Dr. Watson, who's, as you know, is a doctor, comes to Sherlock Holmes and it tells him, look, your test for prostate cancer came back positive. Okay. So, that's pretty bad news, right? But, you know, the test, when you do have the prostate cancer, there's a 90 percent chance that the test will say that you're positive. On the other hand, Dr. Watson also said that when you don't have prostate cancer, there's a 5 percent chance that it will also come out positive. Okay. So, essentially, it has a 99, 95 percent accuracy when it comes to being negative, 90 percent accuracy when it comes to being positive. Okay. Now, the question to you is, should Sherlock Holmes be really, really worried? Does he have prostate cancer? What do you think? I mean, the probability that he has prostate cancer is. I wrote the answer, but what's your, I want some wrong answers first. Otherwise, there's no game, right? So, okay. So, what's your, what's your, what do you think? You know there's a trap, so you won't fall for it, right? You're too smart for that. Okay. So, any idea? Okay. Nobody wants to, to take a risk. Okay. So, the answer, of course, is not that he has a 95 percent chance of having prostate cancer. So, we can try and break this up a bit more, and then there will be a way to introduce a base theorem. So, I said, you know, there's a 90 percent of people don't have prostate cancer. I'll call that A and 1 percent do. And then, you know, the test, if the test is positive, I'll call that B plus, if it's negative, I'll call that B minus. Okay. So, even if you don't know base theorem, you can try and think about it in terms of a decision tree, right? So, in the following way, like you say, okay, first I'm asking, do I have cancer or not? And with 99 percent chance the answer is no. And with 0.01, the answer is yes. Okay. Then I'm taking the test. So, if I don't have it, in 95 percent of the case, I'll be, I'll have a negative test. And in 5 percent of the cases, I'll get a positive test. I'll do the same thing here. Okay. So, these are all the possible outcomes that are for them. Now, Shalak got a positive test. So, he is in either one of these two situations, right? So, if I want to know the total probability that he does have it, I need to know the probability of this, right? So, what I write is just, I write the probability of A plus B plus over the extremum of my probability of, for Shalak to have cancer. So, what if I plug in the numbers? 0.01 times 0.99. So, I just multiply these two numbers. And here, I need to add that number. So, it's the probability that it didn't have cancer. Time the probability that the test can positive anyway. So, you do this simple algebra, and if you do that, you find about 15 percent. Okay. So, it's much lower than you would expect intuitively if you have thought, you know, the accuracy of the test is 95 percent. But in fact, because there are so few people who have processed cancer, you end up with only 15 percent chance of having it, which is still bad. It's much larger than if you didn't have, if you hadn't taken the test or if you had had a negative test, it's much higher than that, but it's still significantly small. So, what is by theorem? By theorem tells the following. So, this is your true status, whether you have cancer or not, and this is what you saw, this is the test. And by theorem will tell you that this probability, so here I was interested in the probability of having cancer knowing that I had a positive test, but this is more general than that, right? So, what I've done here, I took the joint probability, this thing I broke up actually here implicitly as A plus, particular of having the condition times the probability of having a positive test given I have the condition, right? So, this is that term. And here it's not, you can view it as a normalization because I need to sum over all the possible conditions I had to get the probability of having a certain outcome of the test, okay? Yes? Oh, that's because I got it wrong. Thank you. Okay, so this thing here, this is, before I did the test, this is what I thought was the prevalence of the condition in the population. In other words, that's the probability of having the condition if you don't know anything else, if you don't have an attack in the test, right? And this is what we call the prior. And it's a prior, it's your prior knowledge, your prior belief that you may have had the cancer in the first place. And this thing here, it's, if you like, it's your evidence. So, that's information that's given to you by the test, right? And this is called the likelihood. And you have to understand it this way, it's the likelihood that you got a positive test or you got any outcome of the test, given the true condition, okay? That's what the likelihood means. This is usually viewed as a normalization, it doesn't really have a name, because you can really normalize by summing over all possible value of the unknown variable, right? And this is called the posterior. Why posterior is because you had your prior belief, so that's what you thought in the beginning. Then you add your evidence to that, you just multiplication. And then it's the posterior belief. This is what you know now that you've taken the test. You have that, right? So, you can see how you can generalize this. For instance, if Sherlock took a second test. So, assuming the second test is independent, you know, he wants to make sure he really has it. Then you would have a second test, which you would call C. And you could make a decision tree, where you would add new branches here. And in the Bayesian framework, you would just multiply by another factor, right? Sorry, I cannot hear. In the first example. Yes. Yeah, I mean, I made it asymmetric on purpose, because the priority of being wrong depends on your condition, right? So, this would be called a false positive. This would be called a false negative. And in general, it can be different. But also, in order not to confuse you, I make them different, because these are really different probabilities. They don't have to, no, no, this and this add up to one. But this and that don't have to add up to one. Now, this is the priority of being wrong if you do have the condition. This, the 10%. And this is the priority of getting in wrong if you do have the condition. You don't have to sum up to one. So, let me give you another example of how to use this kind of thing. So, I should mention that in many, many cases, you don't really have a good prior. You don't really know what to put for the prior. So, you don't have some prior knowledge of what you expect. So, in that case, it's quite often the case that you remove that thing, you say it's flat. In other words, it's a constant. So, you just take it away from the equation and you just write that the probability that you saw something, that something is true, is just proportional to the likelihood. And so, in fact, you know, maybe you don't realize this, but we do this all the time, even in physics. And I give you an example of that. So, a simplest example is, okay, the general situation is that you see some data, you have some model of the world, right, and you're going to see some data. And what you really want to know is that you see your data and you want to reconstruct what the world is, okay. And this is a good application of Bayes' theorem. You write that the probability of your model, given the data, is just proportional to the likelihood that the data was generated by the model. And let me give you a very concrete example of this, right. So, imagine you have a coin and imagine it's a Bayes' coin, okay. So, you do ten coin tosses, let's say, and you get three heads. You don't know anything about the bias of the coin, okay. So, the question to you is, what do you think the probability of getting heads and tails is giving this outcome? What would be your naive guess? Of course, these are very small numbers, so it's not going to be significant. But what would be your naive guess? The question is, you got three heads out of ten coin tosses. So, the question is, what do you think the bias of the coin is? What's the chance of getting a head? Sorry, one at a time? Thirty percent? What do you say? Ten? One zero, ten percent chance. Oh, no, that's the number of possible, okay, right, okay. We're getting there, okay. Okay, so there's a, let's call P is the bias of the coin. So, what we're going to do is that we're going to do something similar to what you're suggesting, I think. We're going to write down this, the probability, sorry, this, that we got this outcome given my bias. So, here my model, what I would call model is just P. P is the model of my coin is just the bias of the coin, right. This is what, the state of the world, if you like, that I would like to learn about. And this is my outcome, this is my, this is my, you know, this is what I see, right, so you like to invert that. So, simply you say, what's the probability of seeing N? Yeah, so this would be the data. So, I write the probability of N given P, and that's just simple command matrix, N choose, 10 choose 3 are possible combinations. And then I have P to the 3. Okay, so that's the probability of heads, right. Is this, I mean, I could put the actual, I could just put the variables here, it would be easier. So, I have N choose, big N choose small N because I have all the possible permutations of where the heads occurred in the, if I have an ordered list. So, now I'm going to say, I'm going to use bias theorem, but I assume I have no prior, no prior knowledge of a P. So, I just say that probability of P given N is proportional to that, which itself, now I want to write this as a function of P, okay. And if I look that this factor doesn't depend on P, so I'm just left with this. So, this I'll call the likelihood. So, now if I want to ask what's the most likely P that created this outcome, all I have to do is to maximize this likelihood with respect to my unknown, P, right. So, it's known as the procedure of maximizing the likelihood. It's called maximum likelihood and it's very important concept in any sort of machine learning. So, all you do is that you look for the P that maximizes L, okay. In practice, it's much, it's much more convenience to actually maximize logarithm. You see why in a second? Because if I write down the logarithm of that quantity, it's simply N log P plus 1 minus N log 1 minus P. And we'll do it this way because it has some nice scaling properties which we'll see in a second. So, what if I do maximize this likelihood? Can anyone tell me what the result will be? Without doing the calculation? I mean, I think I heard it before. What would be the result you would think? 15, 50. Why 50? I think the fact that you say 50 is because you have a prior. That you think most coins are not biased, right. But here, we're not putting a prior because we're doing maximum likelihood. So, we don't know anything about the coin. It's not as biased as it could possibly be. Okay, but you can actually do the calculation and if you do that, you take the derivative of log of L with respect to P. This is what you get. So, 30%. I mean, it's very naive, right. Because as you know, if the coin was not biased, you could have gotten three, right. So, one has to be a bit more careful than that. And maybe also consider the possibility that the most likely is not the true, right. So, this is what, you know, the true... When people say like this thing in a Bayesian way, they could mean two things. They could mean that they really think that the prior is important, right. So, here we ignore the prior. But the other thing is that you want to consider the full probability distribution of this thing. So, you can do this. So, here I'm representing simply this quantity. So, you see, it has a peak at exactly 0.3. Its average is actually one-third. It's not really important. But it also has a large width, right. So, when you set 50, 50 is not completely crazy, right. It's not really in the tail of the distribution. It is possible as well. So, one thing that what we would like to do, for instance, is try and see what's the amount of uncertainty I have in my estimate. So, here I said 30 is the most likely, but what are the fluctuations, so to speak. So, of course, you know, you could do... You can do some sort of... You can do a fair amount of algebra with this simple form, but I'll just give you a way that's a bit more general. And I can help you calculate the fluctuations in a more general setting. So, the way you do it is that you rewrite this in this manner, right. So, I mean, this is, of course, trivial. So, what you note is that here, right, is the exponential of log likelihood. But if you look at the log likelihood, which I wrote here, I can rewrite it in this manner. And what you see here is that, you know, in terms of scanning, when you get a large number of experiments, so n is the number of experiments you did. It's the number of coin tosses. When you have a large number, these numbers here should scale like one, I mean, of the order of one, right. So, that the overall log likelihood scales like n, okay. So, this quantity, if you like, log likelihood is extensive. And so what that means is that I can do some sort of... I can do an expansion of it. It's sometimes called the cell-upon approximation. But here, there's no integral, so it's a bit simpler. But I say that this is about approximately equal to the value at the maximum, minus one-half. So, I'm doing an expansion of this thing. So, do you know why there's no first-order term? I'm checking if you're following. Why is there no first-order term? I'm doing a Taylor expansion here, right. Why is there no first-order term? You need to speak louder. I really cannot hear you from back there. Exactly, right? So, that's why there's no first-order term. So, we're looking at our function here, at the log of it, and we're just expanding around the maximum with a parabola, okay. There's no minus, yes. Somebody else is following. Okay, but this thing is negative, right? So, if you truncate this expansion, then you end up with just a Gaussian distribution, right? And therefore, you can write things in this way. So, I've just rewritten things. All of this was... I didn't really bother about the normalization. I didn't want to bother about the normalization. I know the normalization of the Gaussian distribution is just this. And then I just rewrote things in this manner so you can really see the variance up here, right? So, what this suggests is that the fluctuation between the true one and the one you guessed, the one you guessed is P star, the true one is P. On average, the mean square error or the variance will be given by sigma P, and therefore, by this thing. So, what this gives you is some sort of a recipe for if you do have the value of the likelihood, you can maximize the likelihood to know what is the true state of your world. But then you can also use... and that's... you do the first derivative, but then you can also use the second derivative to calculate what are the fluctuations. So, here, in fact, we can do the second derivative easily. So, what is that? But remember, P star is actually N over N. So, here, I just took the second derivative, right? And if I plug that in, I'm just getting... So, here, again, you see the extensive nature of that quantity, right? It scales with N. So, this quantity will become larger and larger as you do more and more and more experiments. And that's important because to know the fluctuations, I'll take the inverse of that, which means that my fluctuations here in this case I just, you know, replace will be P1 minus P over N. So, that you sort of know intuitively already. Like, if you do N experiments, the uncertainty on whatever you measure, and here you're trying to measure the bias of that coin, it should go like one over square root of N, right? I mean, you've heard that before, right? For instance, if you run a poll, a political poll for elections, you know that if you have a sample of a thousand, then you know that you should expect fluctuations of the other square root of a thousand, right? So, here, in a way, we re-derived this. We want to know the true fraction of people that would vote for X or Y. But what we have is just a finite sample, and this formula tells you what you expect the error to be, right? It's not exactly one over N, it's modulated by this, but, you know, that's essentially the idea. So, maybe I'll give a second example, which is maybe closer to what we do on a regular basis in physics, and it's linear regression. Is everything okay so far? I mean, some of this you may already know, but I just want to make sure we... Okay, so let's imagine we're doing a polynomial fitting. Well, actually, before that, sorry, I first start with the linear regression. So, let's imagine you have data points like this. You did experiments, you changed your X, and this is what you got for Y, right? Now, you think that this probably follows a linear relation, right? So, how do you find the coefficient of that linear relationship? You minimize the mean square error, right? That's the final answer. So, can we try and think about this in this probabilistic manner, in this Bayesian way, to see whether we find the same result, right? So, here, you know, of course, the 30% we got was also the intuitive answer. You got 30% of the outcome to be that. That's what you get, right? But you can rationalize it with this maximum likelihood. What happens here, right? So, you have to make a few assumptions. You have to model explicitly how your data was generated. So, explicitly, I make the following assumption. I said that, okay, I think I have a linear relationship between X and Y, okay? But if that were just the case, then they would all fall exactly on a straight line. I wouldn't have to ask myself, you know, how do I find the best fit? So, what I'll do is that I'll add some noise. So, I have my end-order points and add some noise, and I have to make some... The noise is essentially, you know, what makes the difference between the linear... exactly in a relationship and the point you actually observe. And I'll make an assumption. EI is Gaussian. I assume it has mean zero, and it has some variance, sigma squared, okay? So, what if now I turn this in terms of likelihood? What is my likelihood? It's the P of my data given the model. Well, what is my model here? Really, it's the coefficients of my linear relationship. What is the data? I assume I also know my Xi, right? So, here I should also say N my Xi. But also here, you notice that in the model, there's also the noise itself. That's part of my model. At least from the probabilistic point of view, it's part of my model. Okay, so this is what I call my model. Now, if I write this down, what is the probability of XI given... of YI given XI and the parameter of the model? Well, it's just a Gaussian distribution. Okay, so my log likelihood, so this is my likelihood, would just be, okay? I've just written the log of what was here. Is this clear, right? Maybe I should say like this, because the probability of my EI is given by a Gaussian with these parameters, epsilon i is this, right? And then I replace epsilon i by this, so I just replaced, okay? When I take the log, I look at this, and I want to maximize my likelihood with respect to the parameters of my model. And let's start with a and b. If I try to maximize with respect to a and b, it's like minimizing this quantity because I have a minus sign here. So, what do I want to minimize the mean square error? Because this is the empirical mean square error, right? So, and then you also see why I should take the log, right? Because that way you really see a sum. So, this whole thing here, it's n times mean square error, okay? So, you can see here actually what we say, what we should do is minimize mean square error. In fact, what we're really saying is that we think our noise is Gaussian. You can easily see that if we had assumed that the noise had a different form, like it had, for instance, power loss or power loss or whatever, then when I take the log, this would immediately turn into a different quantity to maximize, to minimize, that would depend on the statistics of the noise. So, mean square error really means Gaussian noise. In many cases, it's very reasonable, but it's always good to keep in mind that's an assumption. Now, just in passing, like if you... I said, in principle, the noise is also an unknown of your model, right? So, if you don't know it, you can actually also minimize, maximize the likelihood with respect to the noise level itself. And if you do that, where you can see, like, you get... I take in the relative respect to sigma squared, right? To simplify. You see that sigma star squared is equal to this, to two line calculation. And also surprising, the noise that you infer is actually the empirical noise that you measure, right? So, the difference between... and here I should have stars, right? Between your fit and your data. So, there's also a nice way to think about this in physical terms. So, what we're doing here is we try to draw a line through this. Now you can think of this log likelihood as some sort of an energy, if you like, right? So, we try to minimize... maximize the... the log likelihood. So, what if you call this your Hamiltonian? Sorry, minus your Hamiltonian. So, as you minimize your Hamiltonian, meaning maximizing the likelihood. So, when I write down this kind of Hamiltonian, you see here I have many quadratic potentials. So, it's as if I took all my points and I put some springs to each of them that I connect to my line. What this says when I do this minimization that I'll choose, sorry, I'm not going to draw the springs, but imagine a spring for each... each data point that's connected to the line. And then I let the spring... so, the resting... the resting position of the springs is actually a... you know, zero in that case. And then I just let the spring relax, right? I'm looking for the ground state, so the thing of minimal energy. It would be the mean square error, right? So, if I do this physical setup, then my line, which I let free, will eventually settle to the best fit, right? So, this is just to emphasize the fact that many of these likelihoods or maximizing the likelihood can be rephrased in terms of... of an energy minimization. And the energy goes a bit further than that because it's not just about the minimum as we saw for the previous case. It's also about the fluctuations or the uncertainty, right? So, this is my likelihood of P. Sorry, it's not P in that case. It's A, B. Let's assume now for a second that I actually know sigma squared because I know my experimental apparatus and I know how much noise it has, right? So, let's assume I know sigma squared and I just want to know A and B. So, if I write this thing down, now the way I've defined this is simply minus H, right? And this is like in Boltzmann law when I write that the probability distribution of whatever configuration I have is one over Z and it's financial minus KBT, okay? So, here, you know, I can set the temperature to 0, 2 to 1 by convention, but I can also say, you know, it depends on sigma squared. So, in that case, I would have to... I would have to define a different Hamiltonian which differs from this one by just the sigma squared factor. But the idea is the following, though, right? When you write this down, you don't necessarily... when you have the Boltzmann distribution, you don't necessarily assume that all your configurations will find themselves in the ground state. They'll find themselves in states depending on their energies, right? According to this formula. So, here's the same thing. Like, I don't assume that A and B necessarily are equal to A star and B star, which is the maximum of my likelihood. They could fluctuate exactly in the same way I calculated the fluctuations for P in a previous example. Here, I can also calculate fluctuations for A. But you can see the analogy is that the fluctuations... the uncertainty I have about A and B can be rephrased in terms of fluctuations in this stat mech setting, okay? That's what I'm saying. It goes a bit further than just the maximum. It's also about calculating the fluctuations. So, in that case, in fact, the fluctuations, they're fairly, you know, fairly easy. So, I can take this and what you notice is actually this thing here. Log L is quadratic in A and B, okay? So, that means that in general, I'll be able to write this as in the following manner. My A here is just the second derivative, of course, of my log likelihood, the minus sign. And it's simply equal in that case to that. And my B, sorry, it's equal to n. So, you can check this by just spending these guys and regrouping the terms in A squared, A, B, and the cross terms. Okay, so you can calculate them. The point is that each of these quantities scales with n again, right? So, it's important points because the likelihood itself scales with n. And the second point is that, again, when I look at this, it's just a Gaussian distribution, right? So, what that means is that I can now calculate the uncertainties. So, the uncertainty is the expected error by just using Gaussian integration rules. Okay, so I have Gaussian integration, so I need to have a quadratic form inside the exponential. And I know that my fluctuations would just be given by the inverse of the coefficients of that quadratic form. Okay? Is that okay with everybody? Just the Gaussian algebra. Again, you know, the point is that these scales, like one over n, right? Because each of these guys scales with n, I'm taking the inverse. So, now with my analogy, yes, is that better? Yeah. So, now I'm coming back to my spring analogy, right? And I say, I just don't want, you know, I imagine that I really have a huge fluctuate, you know, in the spring analogy, of course, you read the grand state because the sum of fluctuations are small. But now I'm also adding the sum of fluctuations and I see how my line fluctuates, right? And saying how my line fluctuates, to describe this, I need to know this covariance matrix between the parameters to describe my line, right? And here I can do it simply by taking the Haitian, so the second derivative, at the maximum of my Hamiltonian, of my log legio, the minimum of the Hamiltonian. It turns out it's much more general than that. Like, now imagine that it's not just about this Gaussian case or anything like that. So, here everything is exact, meaning that this is exactly Gaussian. In the previous example, you saw how to do some truncation of the tele-expansion, right? In general, if I have some, a set of parameters, so my parameters before were a and b, but now I say I have parameters theta 1 up to theta k, right? So I have more than two parameters. I can have any time k parameters. Then I can always write my likelihood. I can always write it in this manner, right? So I do, again, a tele-expansion. And I can do a tele-expansion because this scales with n, right? So it's, again, the same set of one approximation as before. And so the general statements is that the covariance of theta k and theta k prime, which is, by definition, is this. Again, simply using Gaussian integration rules, it's given by this. Okay, so let's write this down. This is not so important. It's just a generalization to any non-Gaussian statistics and also to multivariate. Okay, so this was just linear fitting. Let me just tell you a little bit just to continue on this. You could have thought that these points were not actually distributed on a line. It could be that your points were, you know, had some sort of curvature. In that case, you may be tempted to think that it's not a line, but maybe it's a quadratic function. So how should you decide this, right? So let me just give you, you know, maybe a cautionary tale. I'm not going to go into the details of the calculations. But let's say that your points are like this, okay? So you have points that are, you know, there's quite a lot of noise. I don't know whether it's a line or anything like that, but let's say that you have good reasons to assume that it's a polynomial. You just don't know the order of the polynomial, okay? So then the parameters of your model will simply be the coefficients of your polynomial, that's your Thela's. So you can try and fit a polynomial of, so, okay, sorry, that's the true value of the polynomial that generated the data. Here what I did is I took a third-order polynomial and then generated data using a similar rule as this, adding some Gaussian noise, okay? But if you want to do a fit, if you do a linear fit, you find this. If you do a quadratic fit, you find this. If you do a third-order polynomial fit, you find this. And if you add more and more order, you get closer and closer, right? So let me play that again. You get closer and closer to your point. The problem is, as you know, the more order you add, the more parameters you add, the closer you'll get, right? In fact, you can show mathematically that if you had, let's say here, 15 points, a polynomial of order 15 will fit all the points exactly, right? Is that the right polynomial? No. Here the right polynomial is order 3, actually. So how do we do this in practice? Because how do we know where to stop? So this is just another way of saying it, like here I calculated the mean square error, so the error between my polynomial and my data points, and I increase the order of the polynomial, it drops, eventually it will go to zero, right? Now, the simplest way to deal with this is to separate your data into two sets. One will be called a training set, another one a testing set. So the idea is the following. There are two-thirds of your data, let's say half of your data to simplify, and you use that half to fit your polynomial. Then you take your polynomial, and you try to see whether it predicts well the second half of the data, which is independent, right? And the reason why this might work is because what happens when you get to this situation is that you basically overfit the noise. So you're trying to fit the noise of your data, right? So this is in the testing data. But if you're really fitting the noise, then if you test it on an independent realization of the data on the testing set, then you get it completely wrong, right? Because you were trying to fit the particular features of the noise of the training data, and that makes sense. So you can test this, but now you add the testing data, so in black are the points that I used to fit my polynomial. But in green, it's new points, so I did new experiments with the same polynomial, which the true polynomial I remind you is of order three, and I just generated these points, and then I see whether the polynomial I learned with the black one is a good predictor of the green one, right? So it's the same animation as before. And the point is that now I can look at the mean square error on the green points instead of looking at it on the black points. So for the black points by definition, it had to go down because I'm adding more parameters and getting more and more precise. But as you can see here, on the green points, it will actually first go down, and then eventually it will go up, right? Why is that? Because basically by the time it gets to 15, my polynomial is just trying to fit the noise of the black ones, and we have no predictive power for the green ones, right? So it goes back up. So you can formalize this a bit better using probabilistic thing. I'm not going to go into the details. This is usually a good procedure to know that you're doing things right, right? So here, in fact, I chose my example well. This thing here has the minimum at three, which is the true answer, right? So here it tells you you should stop at three. Maybe you won't do so bad by going up to five, but after that, you know, it becomes catastrophic. So I'll skip that. All right, let me just... We'll take a break in two minutes, but let me just motivate now why we went through the pain of this. So it's usual to think... So this base low and maximum molecular and trying to learn about the world from the data, it can be relevant in many different scales, if you like, in many different ways. So first of all, as a scientist, right? So we get the example of fitting experimental data. This is one example like this, that we have data and we like to learn about the world. So, you know, we use... mean square error is basically a maximum likelihood. So it's useful to just infer the parameters of a system from the data. It's quite, you know... Of course, fitting is quite usual, but usually as a theorist, what we do, we do the other way around, right? We start from the model and then we calculate whatever, right? We calculate correlation functions or other parameters. And then we see what it gives and then, you know, that we look at these predictions and we ask, you know, does it make sense? Does it describe the phenomena I wanted to describe? So let's see what I would call a bottom-up approach, like you start from the microscopic equations and you try to calculate, you know, what happens. So this is what I did when I generated data with the...in the previous example, with this... this fake data here. This is what I did, right? I had my model, I generated data with it. And in effect, you know, in theoretical physics, in particular, this is very often what we do. Like we start from the model, we try to solve it, we try to simulate it. There's another way which is, I would call top-down, is that you start from the observations. And this is what you do when you fit. You start from the observations and you're going to try and solve for the parameters of the model that give rise to these observations. And in the next lectures, not today, I'll show you examples of how you do this and how it's useful, especially in biology, where usually you have a very little idea of what this could be, right? In physics, even, you don't really often have a very good idea what this should be. And for this, you use huge simplifications of models. But in biology, you know, often you have almost no idea. Like the microscopic details you know from biology are kind of useless to describe the sort of noise and variations you have in the data. So we see examples of that and how we can use this kind of maximum likelihood and things like that to do this. But I just want to mention that it's yeah, I mean this is an inverse problem. It's called an inverse problem because you do the inverse of the usual way. But, you know, Bayesian thinking has become popular in the, you know, in general, even in the general public. For instance, this statistician, Nate Silver, in the state who runs a website who tries to learn to try to forecast presidential elections but also sport results based on previous observations and also based on polls and based on all sorts of data, right? So his way of thinking is that he should take into account all the history of what happens and put all that in your prior and then treat the polls as evidence but also you need to include the noise that there is in the polls and the evidence and then you put that into a big model and then you do your big model and you simulate your model. You try to back out what's the probability that, you know, either one of the two candidates will win. I'm mentioning this because in fact, you know, he used very Bayesian thinking. He even wrote a book to explain why Bayesian thinking is good. It's called the signal and the noise and I recommend it. And, you know, using these ideas, you know, he came up with about almost 30% chance for Donald Trump to win. At the time, I don't know if you remember most other political websites that had forecast of the election gave a 99% chance of winning for Hillary Clinton. Right? And, you know, because he was taking the noise much more seriously, especially the polling errors and stuff like that and correlations between polling errors, he could get something that was a bit closer to the truth, right? You could still say that, okay, he still predicted that she was more likely to win but when you get a 30% chance of something bad happening, maybe it does, right? So this is what happened here. And then, and this is what I'm going to talk maybe next, it's also important for living organisms, right? So we humans do Bayesian thinking but we're not the only ones. If you think about, for instance, catching a target that's flying, right? You need to use Bayesian thinking. The reason is the following. Even a frog, when it sees an insect, it sees it with a delay. I mean, it's the same for you. You see something, like you say you play tennis and you see a ball coming at, you know, your way. The moment you will actually get to your brain for you to process, there will be a few tens of seconds of delay, right? And yet you don't have the impression that, you know, the ball was coming and, you know, before you realize it was here, it was already in your face, usually you can anticipate, right? So what does he mean anticipating? That means you have a model of the world in your brain and a model of the ball moving. You know its speed, so you have some prior knowledge of where it should be next, right? And in fact that's what you use to know where it is. And in fact, in terms of sensation, you think you know where it is here, you think you see it here, whereas actually you haven't seen it yet, right? So for the frog it's very important because then it will, you know, throw its tongue at the insect and get it. And to do this task precisely, you simply need to have a model of where the fly is going. So this would be what we see next. That's biology care about maximum likelihood. But we first take a short break, like 10 minutes, 5 minutes. It's the bad guy out here. Okay, so let's start again. So I already gave you an example with the frog. Let me tell you about chemotaxis. So does any one of you know what chemotaxis is? What is chemotaxis? Does any one of you know what chemotaxis is? Yes? Okay, yeah, okay. Tell me what it is. Speak loud. Exactly, right? Yes, exactly. So it's not just food. Let me give you an example. This is a movie. I forgot to credit it, by the way, sorry. And I hope it works. Yes. This is the 50s. And this is blood. Okay. So what you see here is a big blood cell called the neutrophil. Here you have red blood cells. And the job of neutrophils is to eat up bacteria that have been marked for deletion, right? So it's basically a big monster that wants to eat anything that's you know, it's attracted to. And here in black, this small dog here is a bacterium, right? So it's much smaller than this cell. And of course the immune system wants to get rid of that bacterium that's in the blood. And you see, the cell will actually move with track the bacterium. And the way it does it is because the bacterium is this big cell is sensitive to. And therefore it can track it and chase it until eventually yummy yummy it eats it up. So how does it do that? That's chemotaxis. That's the that's an example also like you see, you know, the bacteria the big cell here is seeing something, right? What about the world that there's a bacterium out there? How does it decide where to go? So it's you can view it, you know, in a Bayesian way if you like, but this is basically the task it's doing. So this is of course a big cell, like you saw the bacterium here is like a few micrometers and then this cell here is much, much larger than that, like several tens of micrometers. But in fact even bacteria themselves also do chemotaxis. So this is like let me show you an experiment not too old, like from 2003. The previous movie was from the 50s and this is based on microfluidics. So what they did here is that they had some microfluidic device where they would basically pump in what they call chemo-effectors. So chemo-effectors is basically what the bacteria like so they'd be attracted to it and so they let them go through here and then they let them out through here and because of that it would create a gradient so you have more chemo-effectors in the upper part of the chamber than in the lower part of the chamber and in here in the middle right in the middle they pump in some bacteria and they want to know where they will end up, right? So they push bacteria and then bacteria will come out of these small pipes. So the idea of course is that if bacteria are sensitive to the chemo-effectors you should see more bacteria in this pipe here, in this outlet than in this one, okay? Is the experimental setup clear? So these are the results. So the curves unfortunately are not very well done but this line here is when they put no attractant at all so they don't put any chemo-effector right? And you can see it's perfectly centered around zero right? So this is the distribution of cells as a function of the channel number in each of these channel like this is 1 and this is 21 But then if they add the chemo-effector and they can show this for function as low as 3.2 nanomotors the nano is 10 to minus 9 and it's this curve here trying to outline then it's biased towards the low channels which means that the bacteria are actually sensitive to these very low concentrations of the chemo-effector and that begs the question of how do they do that but to see how extraordinary this feature of the E. coli the most studied bacterium as you know to re-understand how extraordinary it is you need to do a small calculation first so the first one is a calculation of order of magnitude do you know what a nanomotor is it's pretty abstract so nanomotor is actually 1 10 to minus 9 mole per liter now I have my bacterium what is the size of a bacterium it's a micron because it's called microbes that's why so let's say it's about 1 micron so if I put this E. coli into a bath of chemo-attractors or the chemo-effectors right so I have my chemo-effectors one nanomotor do you know do we correspond to what I drew here or do we correspond to many, many cells many, many molecules very well concentrated what would that correspond to do you have any idea it's not an easy question I know you've seen this before I'm going to give you the answer so you can calculate this you have multiple liters you can convert that into molecules per square meters so how do you do this you have to multiply by avogadro numbers and you multiply by 3 because they are by 10 to the 3 because there's a thousand liters in each square meters cube meters so that's a big number but that doesn't really tell me what it is so I can convert this into micrometers so you see actually I started from avogadro numbers which is very large but when I'm looking at this scale there's going to be in the volume of let's say 1 micron so volume of 1 micron cube I will have at most 1 molecule on average so here I am with a bacterium of the order of 1 micron and this bacterium is sensitive to concentrations as low as 3 nano models which means something like maybe 2 molecules would be present in its body if it was transparent so you see really it's quite remarkable that it can actually detect not only the concentration the presence of the chemo tractant but it can detect its gradients we can detect differences in it because of course if you had chosen a different setup where the concentration of the chemo tractant was constant in a chamber you would not expect any bias so it really has to know about the difference so now immediately we can ask the question of well this looks like it's going to be difficult to do this task for the bacterium so you can try and calculate what's the best possible estimate of concentration C the bug could achieve we'll see a bit later that in fact what the bacterium does is it has some receptors that will bind these molecules and that will create some activity in the cell and we'll see it a bit more in the next lecture but all the bacterium sees is the state of its receptors whether they bound or not to something this is what they see but what they want to know is something about the concentration so we're a bit back to the maximum likelihood of the Beijing framework the bacterium wants to know something about the world here the world is where is the food it wants to know what the food is the concentration of food is at a given moment and what it has it just has this measurement device question this is very polite way of saying that my writing is terrible what's the best possible estimate of concentration C the bacterium could achieve so it's a calculation that was done by Berg and Purcell in the 70s this is what we'll do now I think oh that's a good point yes the board is not high enough right so Berg and Purcell asked themselves that question they said okay I have my bacterium so let's idealize my bacterium let's say it's just some volume some object of dimension A and then it's it's based in a sortable quantity that's in concentration C and the question is how can this small object estimate C so do you have an intuition for what it should depend on it should depend on size A it should also depend on C okay it says two parameters are wrote down so congratulations okay let's start with this the first thing we can ask is okay we do an idealization but the true thing will not be so different let's assume that a measurement device of size A is actually transparent right so it's a sphere or it's a cube let's say and it lets the the ligand go in and out and all it can do is I can count how many ligands there are inside the volume right so let's call it N the ligands inside and I have to be very very careful about the way I write so that's the number of ligands inside volume A3 and you see this is a stochastic quantity right because the the ligands they don't stay where they are they don't stay in place they're in water in a solution so this will fluctuate what is the average value of N though do you know you need to speak louder yes yes so the average value of N is C times A cube what are the fluctuations of N any idea how is N distributed it's Poisson distributed is that what you said yes it's distributed according to Poisson okay so right it's like you put points and run you know you put your molecules in random places with density C right so the amount that will find itself in a given volume is Poisson distributed right is that okay with everyone otherwise I could rederive it but there's no time for that for so I won't just take that for granted so essentially we don't really care about the particular form what we really want to know is the fluctuations should maybe okay and the fluctuations of a Poisson yeah okay sorry I just ignore what you said sorry so the fluctuations are equal to the mean that's the one of the features of the but now if I'm a cell all I know is N that's the only piece of information I have about the world right so you know I put this in parenthesis because I won't use it but you have P of N given C you know the cell we want to know we want to maximize this so this is N given C over C what this will give you for the maximum likelihood estimate made by the cell is that its best guess is just the number of molecules inside the volume divided by the volume right if you think about it it's very similar to the head and tails kind of thing it's just the empirical frequency so here's the empirical concentration but now if you if you want to know the fluctuations of that guy so I call that delta C squared they're going to be especially if I'm interested in the relative fluctuations the relative fluctuations in my estimate would just be equal to the relative fluctuations in N itself just simply because they're proportional and this by virtue of the fact that the fluctuations of N so that this is delta N squared out the order of N again no surprise low large numbers the error I will make on my estimate will go like one over the number of observations here I observe N molecules inside the volume so this is the error I'll make in that case it's just proportional to this now this is pretty lousy if you think about it because now if you still have two molecules per cell as I said in the case of three nano models I had two molecules per cell two molecules per cell means that this number is one over two so my estimate of concentration is very very poor right because my relative fluctuations are the order of one of one half that's pretty bad it's unlikely the stairs need to be able to find the food that way also because the cell doesn't do this it doesn't count things it has receptors as I said so it's a bit more complicated because the cell doesn't have receptors everywhere so it's not you know this is like an idealized measurement device but in practice it would be a much poorer device so what else can the cell do to improve its estimate of the concentration you need to speak very loud because otherwise I just hear well I could do that but let's say that it doesn't have a prior yet right so what you can do is that you can wait because the thing is that these ligands in this idealized situation these molecules they diffuse in and out so I add two parameters to my thing one is diffusion coefficients and the other one is the time of observation so the time of observation it's basically it corresponds to the following situation is that at some points you know the cell and I have three sorry not the cell idealized measurement device of size A and let's say these three ligands into it and then I wait this would be time and then I'll get these guys diffused out maybe this one stayed and then you know a new guy came in so it diffuses so it changes and I want to make measurements that are independent of each other right otherwise I'm counting the same molecules several times so what you know let's say I'll do k measurements this is i equal 2 and then I do a third measurement and I do k measurements okay so now my observations are n1 the number of ligands in this first measurement and 2 until nk and now let's assume that these measurements are actually independent right what should I do if I want to estimate the concentration from these measurements yeah exactly right I take my c star I take 1 over k sum over nk so it's the same as this right I just take the mean if I did p of c given and so I do my Bayesian stuff and I do maximum likelihood sorry I used k and i sorry so this would be maximum likelihood each of my measurements the fact that this probability factorizes is the fact that my measurements are independent of each other and if I do this and I do I take the derivative respect to see equals 0 I find exactly that right you can show that I won't show it but intuitively that's what you expect now if I calculate the fluctuations of that guy so my dc square so it's the same thing as here it will now be the sum of my fluctuations okay so I have this quantity these guys fluctuate still right these are random variables so if I want to know what error I'm making here I need to see what errors I'm making in each of those and then I sum those up and each of these guys is equal to 1 over n and I've k of them so 1 over n k divided by a6 so then I define the n which is you know that the total number of independent measurements I made sorry sorry sorry this is n there's average here okay the algebra may seem complicated but it's actually really not I just need to put all things together and the final result is very intuitive the relative error I make is still proportional to the number of things I saw but now accumulated over time right so it's the number of measurements then the average number of of molecules I see in each measurement that's the total number of molecules I saw that are distinct from each other again low of large numbers you expect that right okay so I have an n average it was given here but I don't have my k and I need k as a function of my two physical parameters which is the time I waited and the diffusion coefficient so can you give me a k should be any idea this would be the last thing I do don't worry we soon have lunch if somebody answers me you know we can have lunch earlier so motivation so we had a total time of measurements of t I have k measurements during that time t so all I need to know is the amount of time here between two independent measurements so I'll call that tau okay then k would just be equal to t over tau alright we ground that so now all we need to know is tau tau as I said you know it's the time it takes to really make an independent measurements in effect it means it's the time I need for all the molecules that are here to diffuse out and new molecules to diffuse in right so I need to renew the content of my measurement device how long does that take okay right so not quite but can you guess by dimensional analysis right I have d what else do I have I have a and I want to know tau okay so how do I do this no the diffusion usually yeah like you should have something like this right okay that distance squared diffusion times time right so so I guess my tau and one can actually do a regular calculation this is what Burg and Purcell did with the measuring device where you really you calculate the Green's function of the diffusion coefficient you can do all of this and at the end of the day you get essentially the same result as what I got but this is it now I'm piecing everything together I said my relative error is equal to 1 over n k so it's 1 over a3 ct with the tau here and I replaced my tau and I should say you know this is really on the approximate so this is the result right and you can see here that okay things scale normally the longer you wait the more you accumulate evidence so the better your accuracy will be the larger the concentration the more molecules you'll see and be able to count so it also improves the faster they diffuse the more independent events you get during a given time so that's also okay the A here is a bit less trivial because you can see it's actually it comes out of this calculation you think if I have a large volume I can count more up to the volume and this is what we had it would go like 1 over a3 but if I have a larger volume it will take longer for the molecules to be renewed and this is this a2 it turns out that the volume wins but so that I only have a here so like the cubic root of the volume instead of just adding the volume so that's it for today tomorrow we'll see how the bacterium actually uses these estimates to move towards the food