 Hello and welcome to probabilistic machine learning lecture number 25. This is the penultimate lecture of this course and actually is the last content lecture in this course and the next one will only do revision. Over these 24 lectures so far we've tried to develop together an entire framework for how to learn from data under uncertainty by distributing truth over a space of hypotheses and then manipulating these hypotheses in light with data under some assumptions about how the data is generated from the hypotheses. We've encountered along the way both mathematical foundations, a significant number of interesting models that allow us to describe in a rich mathematical language how data and latent quantities are related to each other, both in a supervised and unsupervised way for various different data types for discrete real-valued and strictly positive and non-negative and so on quantities and also number of algorithms of computational generic tools that allow us to build real working algorithms on a computer that can solve realistic, somewhat realistic, practical inference problems. We also saw along the way that these algorithms, while useful, do not entirely absolve us of the necessity to be a little bit creative, but together they provide us with a real toolbox to do inference, to learn from data. Now what I'd like to talk about today is what happens once you've done the inference, once you've learned something from your data. Learning never happens at a vacuum. You don't collect information from data and then just throw it away. What do you do once you're done? To give you a deck to tie this back into a concrete example that we did together during the lecture course, here's this data set from lecture number 11, my own body weight over a bunch of years, connected to a few potentially causal lifestyle choices that had effects on my body weight. In that lecture we build a model, a probabilistic model that inferred noise in this data, but also underlying structure and we actually decided to build, or actually I decided to build a linear model for the, and by linear I mean a model which assumes that the effect of these individual lifestyle choices on my weight is a constant in derivative on my body weight over time. So to put it more bluntly, we figured out that going for runs regularly has a larger effect on a decrease in body weight per day than eating less or going to the gym or eating vegetarian. Interesting, so now we know something. Of course this kind of analysis you do because you then want to decide what to do next. You want to decide to maybe change your lifestyle, to take certain choices, to do certain things. This was a relatively elaborate analysis, so maybe this is a bit difficult to start with, but actually you already know what the answer is going to be, right? We're going to have two different kind of, well let's say we had to decide between going on a diet and going for runs regularly. Both reasonably annoying to do, but both have a positive effect on your lifestyle. So what would you do if you were faced with a choice between just one of those two? Of course you could also do both. That's even more annoying, but let's say you decided to do only one of them. Which one would you choose? I think in this case it's pretty straightforward. You just decide whichever one has the largest expected change, right? But actually that's probably the goal, right? Unless there are some other kind of losses involved, like how how annoying this individual choice is. Maybe it's more difficult to to go on a diet and to go on runs and then you have to weigh the outweigh the expected drop in weight with the increase in discomfort in your life. Let's maybe move to a less personal and also a little bit more stripped-down, bare example of such decision problems, because that's exactly what these are, and maybe one that is particularly fitting to the times is one from medicine. So and this is actually a historically relevant one that was often discussed in this context. Imagine that you have two drugs. You have a patient coming in, you know what disease this patient has, you're trying to help them and there are two different drugs to choose from. And because there have been extensive medical tests, because every drug, of course, gets tested a lot, we know the effect that these individual drugs have on patients. So let's write a table and let's say we're only concerned with binary outcomes to make things really easy. That's about the simplest possible decision problem you could have. There's an outcome for the patient. Oh, the outcome is either zero or one. So either the patient recovers, that's one, or the patient doesn't recover, that's outcome zero. And you have to choose between two drugs, drug one and two. Let's say if we choose drug one, then the probability of recovery is about 0.9 and the probability of non-recovery is about 10 percent. And for the other drug, the probability of non-recovery is about 0.2 and the probability of recovery is about 80 percent. So everything else being equal, of course, you know what your decision is going to be as a physician, right? Assuming the drugs cost the same and they have the same side effects and everything, what do you do? Well, of course, you choose this one. Why? Because that number is larger. I know that this is a trivial thing to say, but maybe this is a good starting point for the conversation we're going to have today. One thing to note about this, that this looks suspiciously like a probability distribution, but it isn't because the numbers in here don't sum to one. However, along each column, we do have a probability distribution and we need to have a probability distribution, otherwise this doesn't really make sense, right? So what we essentially have here is something that you can think of as a conditional probability distribution. Here is a probability of outcome given, actually I should probably call this recovery or something, so that you don't get confused with O's and zero's, recovery given drug one and here is probability of recovery given drug two. So let's formulate this a little bit better. Here is a slide that says this. We can use the language of probability to encode the outcome of individual choices for our model that we might have learned from data, given some control variable, which we might call an action, or a policy, or a decision, it doesn't matter, maybe even an option, by writing down conditional probability distributions, the probability of the outcome given the action. Now, the simplest possible question is, if you have such a probabilistic model, which one do you prefer? And even that is actually often not entirely trivial. Maybe you, I mean in this case, you just have two variables, but it's not even clear why you would prefer one over the other, because if you just write down zero and one, I haven't told you that this one is actually the preferred outcome, that this one is better than this one. In a binary setting, it's often the case that one is evidently the preferred one and the other one is the unpreferred one, the bad one, and then it's easy, you just take whichever one has the higher probability. But let's say you have two or three or four possible outcomes. Maybe there is a second option. Let's add that below, which isn't just where we can't just say there is a recovery or non-recovery, but maybe there is non-recovery, patient dies, some serious disease, or they recover, but have side effects. So they might have, I mean, overall they might recover, but maybe they have a serious long-lasting side effects that reduce their quality of life. And then there's a third outcome, which is actually just survival and no side effects, right? Everything's great. You just jump up again. And now, of course, you can imagine situations that are much more complicated. So let's keep those two numbers the same and let's just imagine a few situations that are a bit more complicated. So let's say for this drug, there is a relatively good chance of just recovery without any side effects, but there is also a significant chance of recovering with significant side effects. While for the other, it's, I don't know, basically zero and let's see if I can get this right. 0.79, right? Pretty good chance of recovery. Now, that's, now what, which another decision is evidently much more complicated, right? So it seems like if we just look at like this part of the table, this is actually the preferable drug because if it works out well, it works out much better for the patients, but there is a significant chance that of a very negative outcome. I guess in this kind of setting, most patients would still actually choose this drug, but of course I haven't really told you what the side effects are. So maybe if they are really bad, then you don't want to choose that drug anymore and you would even take this significant risk. Who knows? So what we need to take this decision is, in our case, a conversation with the patient and an informed decision about what to do. In the more general case of such a decision problem, we need to assign a number to the individual outcomes that isn't just a probability, but which assigns a value, a utility, a regret or a reward. These are all words for these kind of functions. Sometimes you might also call them a loss function to these individual outcomes. And of course, this is also actually where the idea of a loss function in decision theoretic formulations of machine learning come from. Empirical risk minimization and so on. So our answer maybe is we need to assign a loss or a utility to the individual outcomes, which you would have to put into the same boxes here somewhere. Actually, we could put them in a second column because maybe we imagine that the utility isn't affected by the action. If it is, then we can't type it in this table, but the framework still works. And compute maybe the expected value of this loss under this probability distribution as a function of the action we take. That gives us a number and then we have a simple optimization problem. Again, we can just take whichever number, whichever choice a minimizes this loss. So this is actually a really simple kind of result and it's just the starting point of the conversation we're going to have today. But let me just point out one thing that sometimes gets lost in these kind of formulations. It's quite common to just say, of course, you're minimizing expected loss or maximizing expected reward or expected utility. But maybe you can think for yourself, why is it the expectation that I maximize? So notice that by writing down this loss as a function of the output, we are essentially just defining a new random variable. So L is a function of X and we have a probability distribution over X. So of course we could compute the associated measure with this random variable L by transforming the probability density function using our transformation rules that we've seen in lecture 3 and so on and so on. So the decision itself really just gives you access to a new probability distribution. Now what's implicit in this idea of decision theory of risk minimization is that you choose the risk, the loss, the utility function such that it is actually the quantity whose expected value you want to minimize. So it's easy to get trapped mentally, this has happened to me as well, sometimes in the fact that L, the loss, is actually itself a random variable. So maybe you might think for yourself, oh, but there might be situations in which the variance actually matters, in which I actually want to, don't want to minimize the expected risk, I want to minimize the variance of the risk and that's totally fine, of course there are such situations. But then you would just redefine the loss, such that it is for example the square of the loss you were otherwise interested in and then you completely expected value of that. So implicitly when I say we use a loss, then I actually mean we use a function whose expected value we want to minimize. That's kind of an implicit definition of what the loss is. So this, and that's actually the entire slide, is decision theory. If you have probability distributions over outcomes, then you can assign losses to these individual outcomes, these losses are the quantities whose expected value you want to minimize and then you just choose as your action the one that minimizes expected loss or maximizes expected utility if you want to be more optimistic and think of this quantity L as something positive, something you want to maximize. In this simple formulation, decision theory is basically trivial, it's just a one slide kind of statement. But there are certain complications that make this process extremely challenging, much much harder, often by surprisingly simple subtle changes. There are in particular two really challenging computations, only one of which I'm going to address today. The first one which I'm not going to address is what happens if you're making choices in a sequence that have effects on each other. So to maybe illustrate this a little bit, imagine that you have to take not just one decision but many decisions. Let's say in your general practice as a medical doctor, you don't just get to see one patient, you get to see a hundred patients. Every single day a patient comes in with the same disease, it's a different patient every time you still have the same two drugs available and you have to take the same decision. Now, which drug do you choose? Well, because the individual events are independent, you obviously take the same decision all the time. One simple way to phrase that is that you just choose the action that minimizes the accumulated loss, the sum over the individual loss. If they are independent of each other, then you can take the expectation inside of the sum and everything's fine. Now maybe the situation is a little bit more complicated, maybe the choice you make in the first week, in the first day, actually affects the state you're in in the second day. Historically, a common situation in which this arises is control settings, control engineering settings, rocket science. If you're taking a decision to steer a dynamical system in a certain direction, then one time step later or an infinitesimal time step later, the system is in a different state and maybe that state is actually worse than the one before, even though locally it minimized your loss. Maybe, for example, imagine a very simple situation, your immediate loss in an individual decision is how much it costs you. Maybe steering a vehicle in one direction or the other has a certain associated cost, so the locally optimal thing to do might just be not to steer at all. But that might mean that over time, the state you're in, the location of the vehicle becomes ever more dire and eventually there's a catastrophic outcome. So you need to steer early on, even though it costs you a little bit, such that over time you stay in a safe space. That's a sort of a totally trivial situation we are all faced with every day. Riding your bicycle is maybe tedious, but you're doing it to get to a better place. Those kind of control settings are studied of course in control theory and they can actually be quite complicated, especially if the dynamical system we're talking about has nasty kind of complicated non-linear behavior. But this is a course about machine learning, not about control theory, so we're not going to talk about this setting. But there is one other challenge which is actually related to this dynamical control setting and that's the one in which we don't know the probability distribution. That's the setting that is maybe much more directly connected to a course on probabilistic machine learning, so that's what I want to talk about. So imagine, if you will, that you don't know the entries of this table. You don't know the potential outcomes of the state space, but you don't know with which probability they arrive. During a global pandemic when several companies in the world are racing to develop medical treatments and immunizations, vaccines against this virus, that situation is evidently extremely practical of extremely practical concern. So let's say we have two choices of a vaccine available that have been developed by two different companies, one of which by the way is just across this window from here. And you don't know yet whether they work, you don't know yet whether the patient is afterwards immune or not. And they might have an associated price, these different immunizations. That's complete of course exactly the situation you're in during a medical trial. And there is a challenge here which is that at least in principle you don't want to take to separate the process of establishing the probability from the process of taking the decision. So just to be clear, of course a simple policy would be initially at first we just assign random treatments to two groups of patients. So we just get 100 people from the street, that's actually what's currently happening here in Tübingen for some of the vaccination trials. And you have them come in and then you give them randomly assigned either the one vaccination or a placebo. And then you just see what their immune reactions are one way or another. That's actually to some degree, it's not an entirely wrong description of the current medical testing sort of regime. But you can imagine that this is ethically a little bit unclean. Because imagine that you have decided for your testing purposes that you need to treat about 100 patients. And you start doing that and after the 10th day you realize that all the patients you treated with the real immunization and not with the placebo actually have now become immune and all the other ones are still not immune. How do you continue with your process? It seems ethically wrong to take the remaining 90 people and give some of them a treatment which you know doesn't work. Or do you actually know that it doesn't work after 10 trials? Maybe you don't. You now know of course during some of our discussions in lecture number 3 for example that you might just get unlucky and maybe those 10 people you saw were actually outliers and the overall efficacy of this treatment is actually much worse. So to maybe make this point with a picture, here's a graph that I've actually artificially created. So let's say we're playing a game, that's essentially what we do here, which has binary outcomes. Either the patient is afterwards immune or they are not immune. Let's just put it as simple as this. I call that the payout of this process. You'll see in a moment why I use this gambling term. And now we do experiments over a sequence of individual processes. Now let's say we have three different treatments available, which are represented by these three different colorful lines here. And I'm plotting the empirical average, so the empirical expectation, the sum of the outcomes, not the expected sum of the outcomes, but the actual sum of the outcomes divided by the number of outcomes. So that's the trivial estimator for the expected outcome for these three different treatments. And they actually have three different possible outcomes, expected outcomes, which I show as these straight lines here. So you can see that the green one has an expected return, so an expected probability of let's say immunization of 45%, the blue one of 50%, and the red one of 55%. These are relative close to each other because that drives home this point quite well. And now you see these red curves. These are actually just draws, just like simulated draws from the newly distributions. So from binary independent distributions for these individual outcomes. And you can see that, for example, the green curve, which is actually the worst one, for the first two treatments happens to produce positive outcomes. So the first two patients that come in are actually successfully vaccinated and for the red one, the first one is successful, the second one is not successful, the third one is successful, the fourth one is not successful, and so on and so on, 50-50. And for the blue one, the first one happens to be a failure, the second one is a success, the third one is a success, the fourth one is a success, and then it goes down again. So now you can see of course that these three lines, they change their positions relative to each other, which one is currently the highest one over time. As we keep learning, this is essentially what we're doing here, we're computing an empirical estimator, it's a frequentist estimator, about the efficacy of these drugs, of these three different drugs, we see that their empirical estimates change their role. So initially the green one is actually the best for the first two trials, then it becomes the worst for a while. After 10 trials, the blue one is maybe the best and the red one and the green one are equally good, then the green one takes over, stays the best for quite a while and then the blue one becomes the best and so on. So let's say we had our 10 patients come in, so then after 10 patients, we might actually now prefer a drug that isn't the optimal one in the end. It's actually like the middle one, not the best one, and the best one actually looks about as good as the worst one. Then for a while, the best one actually becomes the worst one over here again and so on, and only after a significant number of trials, so here this is a logarithmic scale, this is after a thousand trials, do these things actually separate from each other properly. So when do they separate from each other? Well, that's actually also shown in this plot and because this is a course on probabilistic machine learning, we're going to use probabilistic language, these curves around here are the posterior variants that arises from probabilistic inference on these unknown probabilities of payout. And you can see that they're initially overlap and they only separate after quite a while. So let's maybe formulate this a little bit more precisely. Let's say we have K choices available, in this example we have 3, K equals to 3 choices available, and we can take at every single point in time, we can take a choice, little K at time 8. So that means if our patient number i comes in, let's say i is the third patient, we can take a choice between 1, 2, 3, drug 1, 2, 3. i might be our 10th patient and we take a decision between giving them drug 1, 2 or 3. And we assume that these patients are independent of each other, every time they come in we give them one of these drugs, we decide which one to take, because of course we have the choice which one to take, we might do so randomly, but let's see whether that random policy is the right thing to do. With a probability, no, sorry, we make this choice and then we get an outcome, a binary outcome, with probability pi. So the outcome is binary, the outcome is either 0 or 1, success or not success, but the probability of that happening is an underlying probability pi K. That's the thing we don't know. Now, you already know since lecture number 3 of this course, how to take this decision. Sorry, not how to take this decision, but how to build the inference algorithm to learn this unknown variable pi in a probabilistic fashion. The right way to do so for the Bayesian is to assign a conjugate prior to the unknown variable pi, the natural one is the conjugate one is the beta distribution, thanks to, well, Piersimo Laplace, and in the, in place of this binary form, which has a normalizer and then it's the PDF's of form pi to the A-1 times 1-pi to the B-1, where A and B are parameters of the distribution, which we can choose for any non-negative value, actually for any positive value to make this a proper prior. The extreme, well, if you make these observations with binary probability, so I don't even have to write that down anymore because you've now seen it several times, then the posterior distribution is actually also a beta distribution with updated parameters where I've introduced notation mk is the number of successes. So that's the number of times we have, like out of the number of times we've given someone this drug, let's call that number nk, we have had, we've seen mk successes, so mk is obviously less or equal than mk. Then the updated posterior parameters are, and if you don't know why that is the case, then maybe go back to slides to the slides for lecture number three and have a quick look, is a beta distribution with updated parameters A plus mk or B and B plus mk minus mk, that's the number of failures, I don't know. The most extreme prior, the one that is most uncertain in some sense, at least in the logit sense, is the one where A and B go to zero. That's not a uniform prior, that would be A and B equal to one, it's the one that is very spiky that basically assumes that there could be extreme probabilities either on the left or on the right hand side, either for zero or for one, that if we choose those then this is not a proper prior before we make any experiment, but let's say we already had both positive and negative experiences, so we have enough data to dominate that prior, then we can basically get rid of this prior in this way and we'll get a posterior distribution, which is beta distribution, which has among other things two moments, a mean which is given by the number of successes over the number of total trials and a variance that is given by this somewhat more complicated expression, which is the number of successes times the number of failures, divided by the number of trials squared times the number of trials plus one, so that's a complicated expression, but we can actually simplify it a little bit, we can realize that mk is typically on the order of nk, so here is a number on the order of nk, here is a number on the order of nk, these cancel with that n square down here and we have something that's on the order of one over nk. That's the kind of number that I'm plotting here in this plot, actually of course I'm not plotting the variance itself, I'm plotting the standard deviation, which is the square root of the variance and that gives a form of error bar, it gives the expected distance that this random number is from the empirical mean, you can see that this is obviously a good kind of thing to look at because these spiky curves, the ones that are the actual behavior of in these individual runs of this empirical estimator, have a behavior that is well described in expectation by these kind of funnel-like shapes, so for each individual distribution. And so this kind of structure, this probabilistic kind of interpretation, maybe gives us a hint for how to choose our policy of which drug to give to which patient. Our maybe very first idea was, and that's actually what happens often in clinical practice, is we just fix a number of patients we have to treat and then just do that many treatments and then decide that we are now confident. How would we choose that number? Well we could look at a plot like this, you have some kind of prior assumption about what the actual value is, the actual probability and that's actually what our prior kind of encodes, it could be a number anywhere between zero and one. And as a function of what the actual number is, we are going to get a variance, so you've seen that this will give a variance that actually depends on the two values of pi, so hidden in here in mk and mk minus mk are the actual two numbers of pi which we don't know yet. So if you wanted to know what the expected value of the variance is, then that has something to do with what the real probability is. And as a function of that probability we can say how long it takes for these funnels here to separate from each other which in this case happens somewhere around here and maybe that's the point where we want to stop. Why do we want to do that? Well let's just be clear, that's obviously still not a guarantee. So imagine that we decided to take a thousand measurements, so we decided to treat a thousand patients, then in this experiment these funnels are actually separated from each other. Well there are two problems here, the first one is that these funnels are expected errors, they are not actually hard guarantees. It might well be that after a thousand runs you can actually see this over here, every now and then these two of these lines might actually still be crossing each other. That's unlikely to happen, but we might just be unlucky that right at our one thousand patients we have this kind of situation and then we actually get the wrong policy out. That's one problem, we don't have hard guarantees, so we don't really know when to stop this experiment. The other problem is it might be that the differences in these lines are so stark that we actually separate the hypotheses from each other much earlier. So this plot is deliberately chosen such that this happens very late because these numbers are very close to each other and they are all very close to the most extreme value, 50%. You can maybe convince yourself that the way to maximize this variance is to put the probability, the real probability pi at 50%. Now imagine that one of our drugs has 90% efficacy and the other one has 0.1% efficacy. Then of course these plots would separate, and I encourage you to try that yourself, they would separate much much earlier. They would just deviate from each other and then we would just know very quickly with high confidence that these two treatments actually have different outcomes. And this sometimes actually happens in medical trials. Maybe you've heard in the news, I mean that's what I have as well, that sometimes medical tests actually get stopped early for ethical reasons. Because either it has become clear that the drug under test is so bad as such massive side effects or low efficacy that it would be ethically impossible to continue with the test. Or it turns out that it's so good that it's non-ethical to give a significant number of patients the control rather than the actual treatment. And then in the latter case you might just end the trial and start treatment and in the former case you just totally scrap the trial. So we would like to have a policy which prevents both of these problems, which on the one hand is adaptive so that if the two variables are extremely different from each other, if the two probabilities are very different from each other, then we take a decision to commit to spend a lot more, to focus a lot more on the treatment, on the good one. Or if the two are extremely similar to each other, if we're still uncertain after a significant number of trials then we keep experimenting playing around with these different treatments. And that obviously means that we need a policy, that's what these rules are called, which is adaptive over time and which keeps exploring between the individual treatments, but increasingly less so as the number of successes and failures grows and the total number of experiments grows. So one idea to do that, which I've basically motivated by this kind of observation and you might already have seen it down here, is to choose an option which at time i, actually I should have put an index i here somewhere, at time i, choose option k, which maximizes this expression, where this expression is the current empirical mean, so that's this number, of the outcomes, plus a constant times the standard deviation, so that's the square root of the variance. That's what this means is, it's really just draw along each of these lines, around these lines, such funnels of variance estimates and see them contract over time. So why is this a good policy to do? Well imagine that we decide to, that we early on get, like randomly happen to get a lot of outlier positive results in action k. Then pi bar k will be large, larger than it actually should be, it's deviating from the true value. Well by how much would it deviate, assuming that we have independent trials, well it will typically deviate by something on the order of the square root of sigma squared. So that will, might actually lift this one estimate over all the other ones. Now what's going to happen is, if it actually, so first of all, if it's only a minor deviation, it doesn't matter because the other ones still have this kind of bonus of their variance, which will lead us to use them. But let's imagine that our pi k is really, we get really unlucky and pi k bar is actually a very high estimate, it's just outside of the variance. Then we will choose that option a few times and that will lead to, even if we keep being unlucky, it will lead to the variance dropping. And the variance will drop at a rate, which is just the right one because it's the expected square distance, such that we will eventually stop trying this option and try the other ones because their variance of course hasn't dropped because we haven't chosen that option. So none of these numbers here have increased and this is still a large number. If you find this kind of idea tricky, then you might want to stop this video at this point, but actually not quite at this point yet, but in a moment because there is a constant in here in this equation. And of course you can choose that constant in various different ways. For this plot, I've actually chosen the constant to be one, it's one standard deviation, I think. And that means that of course these lines kind of stop overlapping at this point over there. If I would have chosen this constant to be a larger number, then they would have stopped overlapping at a later point and if I would have decided to choose it a smaller number of C, then they would start non-overlapping at an earlier point. Now how do you choose this constant C essentially has something to do with how exploratory you want to be. If we choose C to be a large number, then this will dominate over the empirical mean estimate all the time basically. You can just make it arbitrarily large and because pi bar is bounded between 0 and 1, sooner or later that number won't matter anymore and just the second number here totally sort of dominates. And then what we'll get to see is a behavior of this policy that we've designed here which just always chooses whichever is the most uncertain option at the moment. So that's maybe like a pseudo random almost kind of decision to choose individual treatments. And if he makes C very small close to 0, then this policy essentially just always chooses whichever is currently the best empirical mean. So maybe stop the video at this point, think for a second about what we just did here and try to find out for yourself how you would choose C under this policy. So I hope you actually took this opportunity to take a bit of a break, maybe walk away from the screen for a moment, get something to drink. And thought about this problem that we've been discussing here which is actually not entirely trivial. Maybe you've convinced yourself in your thought process that this parameter C, which we've been pondering here about, that this is a knob we can twist to trade off the behavior of our, it's called an agent we are designing here. We could just also call it an algorithm or a policy that this policy trade that this parameter C trades off in this policy between an exploratory and an exploitative behavior. This is actually like the term that is usually associated with this kind of problem. It's called at least this is a simple form of the exploration exploitation trade off. Why is it that well, if you choose C to be large, as I just said, then this term will dominate and this term is associated with uncertainty with the stuff we don't know our estimate for uncertainty. And our agent will always choose options about which it's currently uncertain, no matter whether they are good or bad. That's exploring. On the other hand, if you make C very small, essentially close to zero, then this term will always dominate because it sort of ranges between zero and one and eventually becomes some constant between zero and one. And the agent will always choose the one option that currently looks best under our current empirical stochastic estimate. That you could call that exploiting because instead of exploring to gain knowledge, we just use whatever we know so far, believe to know and just follow this no matter if it's a little bit wrong or not. And maybe you've convinced yourself that the right way to choose C is not to make it constant. The right thing to do as you would in our medical example is that you initially explore, you initially try out different options. And then over time, you become a little bit more certain about the expected success rates of these individual experiments and slowly begin to use what you know while making sure that you never entirely exclude one option because you will never be totally certain that it's not actually maybe the optimal choice. And to that end, this parameter C has to grow slowly over time, where slowly actually means there's a little bit of wiggle room for what it should be, but it should grow slower than the rate at which our uncertainty estimate drops. So remember that the variance sigma square drops with one over N. So the standard deviation, the square root of the variance drops like one over the square root of N. And therefore we should choose C such that it grows more slowly than one over the square root of N because, and this is a little bit tricky. This means that initially our algorithm will, okay, very initially it'll explore a little bit. Then it'll settle on some options that look good and keep using them. So when it uses them, their variance drops and sooner or later the other options that haven't been chosen for a while because this parameter C keeps growing will become interesting again. Even if their payoffs are pretty low. And then the agent will decide to choose those options again. Doing that once or even just a few times will reduce their variance and it'll reduce their variance faster than C grows. So unless their estimate for the actual return increases in turn, this term will again drop out and we will stop exploring that option again for a while. And over time now because C grows slower than the variance drops, we will do this less and less frequently. And that's good because that means eventually once we've done a lot of experiments, we're pretty certain about what the expected payoff is and we'll just use the non-optimal choices less and less frequently over time. Where how much quickly less and less frequently over time we choose them depends on the function we actually assigned to C. Now this was an example. It's a simple setup for binary outputs and I could sort of divide it on three slides. Maybe you've been wondering, okay, but this seems like a really restricted setting. It's just binary results and he's using this Bayesian framework with a beta prior. This seems like we're imposing some prior assumptions. Is it allowed? Is it possible to do that? Is that acceptable? Is it general enough? Maybe we want to have a more general framework? Actually, it turns out this kind of argument that I've made here can be generalized quite a lot and this is one of the points where statistical learning theory actually provides us with a very helpful piece of machinery. In particular, I'm just going to argue this on an abstract level. I will show you the math, but it's very simple. It's a little bit too complicated to do in detail, but maybe intuitively you can convince yourself that this estimate for the variance actually often won't depend exactly on what kind of prior we use and not even necessarily on what kind of data we see and what kind of likelihood we have. Roughly speaking, over time as you get more and more data, assuming that we do independent observations, independent experiments, our estimate for the uncertainty for the variance is more or less independent or more or less isn't really affected by the prior choices we make and by the kind of data type you get. Maybe the two things we actually have to limit ourselves to are just the fact that we assume we make independent observations so that we cannot really be fooled by the effect of one experiment taking a particular decision having some kind of backdoor effect into some future observations. So we assume that our choices we make at some point do not affect the outcome of later experiments other than through our policy and maybe we have to constrain the return, the kind of losses we're expecting to see to some finite regime, to some finite domain and the natural one is between 0 and 1 because if it's some other domain that isn't between 0 and 1 we can just rescale it so that it dies between 0 and 1. And then it turns out that there's actually theory in statistics which provides us with general bounds on the deviation of an empirical estimate for the mean from the actual mean of the distribution. These are like one such bound, there are actually several such bound but one which is particularly nice is the so-called Chernoff-Hefting bound which is a special case of the so-called Chernoff bound. Chernoff, so by the way the name of these two people suggests that it comes from somewhere in Eastern Europe I guess but actually it turns out that both of these people spent more of their lives in America rather than in Eastern Europe. Chernoff was actually an American statistician and Hefting was born in Finland but he actually emigrated to the US at some point in his life. So yeah, there you go. And they actually I think separate from each other as far as I understand Chernoff initially provided like a general form and Hefting provided like a special case of his bound that states that is essentially a very technical version of a statement that is related to the central limit theorem. If you estimate the mean of IID random variables then the mean is a sum of the, here we go, the mean is up to normalization the sum of the variables you get to see then because it's a sum of IID random variables that sum is approximately Gaussian distributed and actually you can make that a little bit more precise to make a statement like this which essentially says if you want to see the full math you can sort of stop the video here and think about it if you like. But what these statements really say is that the deviation of an empirical estimate from the mean, this is the real mean here of the distribution and that's the empirical estimate of the mean, is bounded essentially by a probability or improbability by a kind of Gaussian expression. So it's an e to the minus square of the distance from the mean so that if we, as we increase the number of observations the probability that our estimate for what the payoff of this one individual experiment is from the value that it would have if you would already know what the expected pay off is. So our estimator from the true value of this expected value is essentially bounded by a Gaussian distribution. That means that our variance of this estimate will drop like the variance of a Gaussian it will drop like one over the square root, sorry the variance estimate will drop like one over the number of experiments and therefore the error estimate will drop like one over the square root of the number of experiments. That's the general statement that sort of generalizes this kind of analysis I did here with a concrete Bayesian model with a concrete prior and a specific assumptions about the observations, binary observations with the newly likelihoods and the beta prior to the general setting of variables that are in the unit interval and which have this kind of independence well yeah conditional probability structure actually. So the fact that these general bounds exists maybe suggests that there should be a general class of algorithms which are relatively independent at least asymptotically independent of the specific prior chosen over the outputs of these individual choices. At least for this somewhat less restrictive set but still set of assumptions that the payoff of these individual choices, the loss, the utility lies with an abounded range and that they have this kind of conditional probabilistic structure. And these algorithms as some of you might already know are usually subsumed under the moniker of so-called multi-arm bandit algorithms and this goes back to a now very famous paper from the early 2000s by Peter Auer and others which introduced this idea of a multi-arm bandit. So now I should of course have to tell you what a multi-arm bandit is so here is the definition of that and then we'll find what the algorithm actually is. This goes like the first of all where does the name come from so the motivation historically of course are these experimental design problems like this medical problem that I outlined here in the beginning of the lecture. It's often reduced as mathematicians like to do to a specific kind of mental game which is that you are in a casino and there are these one-armed bandits, you know these slot machines where you put in a coin and then you pull one lever and then every time you pull the lever some random process starts, some wheels start spinning and they stop at some point and depending on where they stop they give a kind of payout and that payout is limited between zero you don't get anything and a sort of sort of maximum jet pot that you could win. Let's imagine that you're in a casino that has many of these one-armed bandits next to each other as they exist in the US and you could walk around and decide which of these bandits to play. So you have a bunch of coins in your hand, you could walk around basically forever, put a coin in one of them, pull the lever and you have to decide which of them and let's say that they all have different payoffs. One of them is particularly good, gives you more frequent wins of higher expected values and others have lower expected payoffs but they are independent of each other so their payoff does not change like the expected payoff, the distribution over the payoff does not change over time. Then this is formalized in this kind of setting so we call this a K-arm bandit for K different choices, it's a collection of random variables which are associated with the arm of the bandit and we assume that these payoffs are independent that are identically distributed. Now what we are looking for is an algorithm often called a policy which given what we've done so far and the outcomes of previous experiments decides which machine to play at time n based on past plays and rewards. The connection to our example is that our drugs that we give to patients are the individual arms of the bandits and the payout is the result for the patient, what kind of recovery they had and what kind of side effects they had which we might associate with an individual cost. And we can define a concept called the regret that's important in this kind of literature, the regret are A of n, so A is the algorithm we use, n is the number of times we've played, that's a quantity that describes how good an individual agent, an individual algorithm policy is and it's given by the difference between the, and here this is usually phrased in terms of maximization because of course with bandits you want to get as much payout as possible, it's the difference between the best possible outcome you could have, what is the best possible outcome, well that's the one you would get if someone just told you which bandit is the one that has the highest expected payoff and then you just go to that machine and just keep playing it all the time. So that would give you a payoff on average of the expected outcome of this individual best machine times the number of times you've played it. But instead we are using our algorithm which has to explore and play around and that algorithm will have follow-up policy that chooses to play the machine k, n times, sorry t, k, n times after n plays in total and of course how often it does that depends on the algorithm, so we'll compute the expected value of that so that we get a quantity that is just like a non-random number for this particular algorithm and multiply it with the individual expected payoff for this one and bandit, right, so our algorithm will go around and choose certain bandits in some order, what, how often does it do we expect to play this individual bandit and what's the expected payoff, that's mu j, let's sum that up and subtract it from the optimal payoff. Now that will mean that the difference between those two, that's always going to be a positive number because this here is larger or equal than any of the entries in here, will be some positive number which will grow, right and now what we want to do is to minimize this regret, so we want to choose an algorithm such that this number gets as small as possible. Let's remember that the optimal thing would be zero but of course we won't get that because we have to learn while we play. So the algorithm we just developed by, so we didn't really, like I didn't write it down, is a special case of the class of these bandit algorithms and it turns out that one can actually phrase a general rule that is called the upper confidence bound algorithm which has inspired a lot of subsequent algorithms which is also from this original paper in 2002, which looks like this, it says play each machine once at the beginning, so that's, remember in our experiment, in our example as well, I assume that we have to use a prior that uses a and b equal to zero, a sort of an implicit prior, one that doesn't, it's not, it's not improper as well, so that only makes meaningful estimates here if we played every individual instance at least once so that the denominators in these factions are non-zero, otherwise it's not well defined, so we need to do this at first and then afterwards go around and play at time t the arm which maximizes the sum of two terms that are quite similar to what we had before, it's the empirical mean of that arm, that's what we had before, so that's mk or mj divided by nj plus this quantity which is the square root over two times the logarithm of the total number of place divided by the number of times we've played this one bandit, so remember that one over the square root of nj is the sort of quantity that drops like the standard deviation of our estimate for the expected payoff and two times square root or log, square root two times log n is our function c of n which grows more slowly than the variance drops over time with that kind of choice, so you remember that we make sort of a choice for a particular c that is somehow should just be growing more slowly than the variance drops this is a particular choice for this drop that is maybe particularly useful because it allows a statement like this which is an original theorem actually from this paper it's a kind of typical statistical learning theory type of statement, it's a regret bound that says if you use this policy that I just showed then with arbitrary reward distributions so that means we have essentially an arbitrary p on or distribution over rewards which have expected value, let's call them pi, actually mu, sorry mu i with support on the unit intervals that's very similar to the setting we described here before, then the expected regret of this algorithm is some complicated quantity where the deciding the important bit is that there are numbers that are bounded but that's just some kind of number, but the only thing that actually depends on n is the important bit which shows up in a sum so over k terms which means that this entire regret is an expression that is o of the number of arms, so the number of terms in the sum k times the logarithm of n and one simple way to phrase that is that this or to face what this theorem says is that this algorithm is going to choose non-optimal options, options which are not given by the optimal one with a rate that grows logarithmically or with a rate that is logarithmic in the number of place or another way to face that is to say the suboptimal arm is played at most logarithmically often and if you're a computer scientist then logarithmically often is good because it means it's basically asymptotically not at all if you keep playing and eventually you'll start playing the optimal arm most of the time and only logarithmically often intersect some exploratory place here is a visualization of how this algorithm works, again on our setting let's say we have our three different drugs and let's say that these probabilities are another probabilities for a good outcome so 55% recovery rate, 45% recovery rate, 50% recovery rate as I had on a plot a few slides ago and here I'm plotting the behavior of this algorithm, this UCP upper confidence bound algorithm on this kind of setting what you see in the left hand plot is the sum over the individual rewards collected so let's say we covered patients, we covered patients from these individual treatments over time and there are sums over the total returns from this particular policy so whenever the algorithm decides to use that one policy I increase the count by whatever the outcome is so we do a Bernoulli experiment with this probability and then we get an outcome that's either 0 or 1 and add it to this count and you can see up here in this sort of what looks like a barcode plot up here a histogram essentially or a sort of a record of which arm, which choice gets made when and so clearly what we want to do is to take the green one, the optimal one, the one with 55% payoff and a white means that this arm is chosen this is the 50% one, that's the 55% one, that's the 45% one and you can see that initially the algorithm explores around it tries all sorts of different options and that also leads to sort of an overall similar behavior of all of these and then actually we get a little bit unlucky in our policy and for a while the blue option which isn't optimal looks like the best one so during that phase the algorithm actually still explores because it's not yet totally sure that it has reached like that it knows what's going on and then that's also why because we keep exploring eventually the green line actually emerges as the optimal kind of choice and the algorithm starts playing that arm of this bandit more and more frequently and you see sort of a white stripe showing up here if you would keep doing this, if you would run this experiment for many thousand runs then this will just become essentially a white stripe with logarithmically interspaced black lines where the other options are chosen here on the right hand side you see that the algorithm basically works this is the regrets, so the difference between the payoff collected by the algorithm and the optimal one would get for always playing the optimal choice so if you would do the right thing all the time the regret would be zero, you want this thing to be low so here in black is the bound on this regret, that's the bound that is in this theorem and in blue you see one particular instance, this one of this algorithm running over time and it's regret accumulating and of course because it actually does that it gets sort of a distribution of rewards, these are not the actual regrets they are just the samples of this regret, sometimes this regret is actually negative because we're collecting random numbers so we might just get lucky and actually get more positive outcomes than would be expected that's why this line is missing here somewhere initially and in red you see the actual regret, so the expected, for every arm the algorithm plays I collect the expected payoff rather than the actual payoff, that's why you get a more smooth line growing over time, so growing of course is bad, we want this to be flat but we already see that there's this bound up here and we're guaranteed that this will flatten out further on you would actually see this if we would run this experiment into like the ten or hundred thousands of experiments then this line would flatten out and just become more and more flat and logarithmically flatter so these kind of algorithms, that's a gray slide which was the main result of the entire lecture these algorithms are basic forms of decision results or decision methods which are maybe sort of a contemporary but a 20 year old idea for how to solve this problem of experimental design with binary choices and independent outcomes with bounded regret or reward in a more elegant way than to design a study that has a finite number of patients and then stops doing experiments and just directly switches into clinical practice if you like if you think of this medical application these algorithms called bandit algorithms apply to this kind of setting I just described and they actually can be shown to yield policies which have a bounded regret so if you let n grow arbitrarily large then the regret grows only with a logarithm of n which essentially doesn't grow at all these algorithms provide a basic way to do experimental design it turns out actually that they even work if you assume that someone gets to fiddle with the bandits with the multi arm bandit or actually all the arms in your casino while you're playing there's a wonderful paper called gambling in a rigged casino I very much recommend you to read because it's a fun read with a little bit adaptation of the algorithm but these methods do not of course answer every possible design experimental design problem they are constricted to this relatively simplified setting of discrete choices with bounded payoff under IID observations so in which senses there's a problem well here are some ways in which the assumptions of bandit algorithms are often violated in practice the first one is of course that you often don't just have discrete choices even in this medical example yes it feels like yeah maybe you just have one two three treatments available but often in medical applications you then get to choose the doses so how much of an individual medicine or maybe also a vaccination do you give to the patient and in other settings like we have them in engineering for example in prototyping problems you even have this in machine learning you have continuous parameters to choose what is the optimal learning way to let your optimizer your neural network train fastest with stochastic gradient descent what is the correct choice of setting for some machine such that it works most precisely and so on in such settings we have continuous domains over which we have to optimize our choices and of course that doesn't really work with an algorithm that fundamentally requires us to play every single option at least once if you have infinitely many options you can't do that another problem is that yeah so you cannot play it here or twice but another problem is that there is often actually so I describe this algorithm maybe as a remedy for the kind of prototyping setting we might have in medical settings maybe it's good that bandit algorithms continuously move from exploration to exploitation but in some prototyping settings in industry you really have a discrete step between a prototyping phase and a market phase so you're building something you want to understand how your system works you want to deliberately explore for a while and then know as much as possible about how to act optimally and then you have to ship a product that you can't correct easily anymore and you have to rely on it being good those kind of shifts between settings are not easily modeled in terms of regrets because they sort of this kind of phase introduces very nasty kind of shifts in regret there's even the more extreme setting where you can do different experiments that have totally different cost structures and I'll mention that in a moment so we don't have much time left in this lecture and I don't want to end this course with a long-winded derivation but yet I'm just going to give you a quick rundown of the history of this part of our field over the past few years because in this kind of course we are not the first ones to notice that these problems exist so over the past actually more over a decade maybe actually almost two decades people have been thinking about how to remedy this by expanding the framework of bandits to more and more elaborate settings maybe the most important generalization that one first has to address is that to continuous domain so that's this kind of setting you've made, you've done three experiments in some kind of domain this is a variable that you can control let's say it might be the logarithm of the learning rate of your optimizer for your deep neural network or something similar and you've gotten three different kind of outputs which are of course with error bars measurements of some kind of function so for example that could be the log training or log test error of your deep neural network on your test set so what is the optimal learning rate? that's the choice you have to make here and it might be anywhere in this kind of domain so you notice right away just by looking at this plot that you cannot get away here without making strict assumptions which you might as well do in a probabilistic sense without prior assumptions why? because there is no uninformative prior an uninformative prior would require you to test every single option once before you can even make any kind of statements and that's just infeasible in a continuous domain which has infinitely many options so let's put a Gaussian process prior over this function why a Gaussian process? because it's the one model, tractable model for continuous functions or continuous domains that we actually have and use that to predict where the minimum of this function might be this corresponding problem is then or this problem the solving this problem, figuring out where the minimum of such a function is is then depending on the community either called continuous arm bandits or Bayesian optimization which you might have heard before and it's actually one of the more successful parts of probabilistic machine learning in the past few years it has had its heyday sort of in the around 2015 maybe and so it's based on this idea we've put a Gaussian process prior over our function that you've now conditioned on these three observations this gives us a posterior over the true function which I've sort of represented here by three samples and a posterior Gaussian process marginal and we are not interested in the value of this function everywhere we're interested in where this function has its minimum now it turns out that you can construct the distribution over this minimum that's this red distribution here below one way to do so in a pedestrian fashion is to just put a regular grid over this domain and then draw a large number of function values individual ones like these black curves and find where their actual minima are so for this sample for example the minimum is here the other sample the minimum is over here and for a third sample this minimum actually happens to be right at the boundary you can imagine maybe convince yourself that at boundaries there is something special happening because these functions are differentiable so they might have positive or negative derivatives at the boundaries that push all of the mass to the right and left and if you do this experiment if you do the sampling process a lot of times and collect the histogram you get something like this red curve down here so that gives us a probabilistic model for where the minimum of this function lies and now what we want are algorithms which are the continuous equivalent of the banded algorithm so they over time decide to do certain experiments such that well maybe such that the function values we collect get smaller and smaller over the past 10 years actually more than 10 years many such algorithms have been developed and there was basically a heyday of these kind of algorithms in the 20s or around 2015 maybe or so they're still relatively popular because they have been designed to address various kind of specific problem settings we're towards the end of this final content lecture of the lecture course so I don't want to bother you with all of the technical details I just want to give you like a highlight reel of some of the functionality that these quite elaborate basic intelligent agents, experimental design agents have now reached I just want to show you two algorithms that doesn't mean that those are the ones you're supposed to use if you face this in practice but there are sort of maybe two corners of a spectrum of algorithms ranging from relatively theoretically motivated ones to and also quite lightweight and in many ways sort of easy to implement algorithms all the way to relatively complicated methods that have rich functionality but are also maybe more demanding both in terms of compute time and in terms of design time you don't have to understand how all of these algorithms work I just want to highlight that this Bayesian reasoning over decisions can be directly used to control for example also experimental design in machine learning the first algorithm I want and have to show you is called the GP Upper Confidence Bound so you can guess that that's a generalization of the Upper Confidence Bound algorithm we just previously discussed why do I have to show it to you because it just as I'm recording this lecture recently received the ICML 2020 Test of Time Award it's by Naranjant Srinivas and Andreas Krause and Shem Kakad and Mathias Seger from 2009 and it's actually very simply speaking a more or less direct generalization of the UCB algorithm to the Gaussian process setting so remember that the UCB algorithm shows arms in our bandit which maximize the sum of an exploratory term a mean and a quantity that is related to the variance or the standard deviation of our estimate here we want to minimize that's why there is a minus in here and then in the constant which on previous slides are called C which in this paper is called the square root of beta but other than that's actually exactly the same why can we do that? because Gaussian processes have these quantities they have posterior mean functions and they have posterior standard deviations so instead of choosing among a set of discrete choices we now choose continuously wherever the sum of these two quantities is minimized and now we just have to choose this constant square root of beta in the right way in the original paper there are actually two different ways of motivating such choices one of them is a probabilistically motivated argument the true function we want to minimize actually is a draw from a Gaussian process prior so we have a truly probabilistic estimate over the true function then set beta to some complicated expression that involves again a logarithm that's the important bit two times a logarithm of some quantities that depend on the dimensionality of the problem and the number of times you've evaluated and so on and so on then you get a regret bound that is again very similar to what we had on the previous slides a bit more complicated to express that we had on the previous slides about UCB which says that the regret is bounded above probabilistically by a term that grows only logarithmically in time and that means that if you sum up your regret over time and divide by the number of times you've played the number of choices you've made then you get a number that goes to zero and that's usually called the no regret setting this corresponding function looks like this red curve here which is just the sum the weighted sum of mean and posterior standard deviation so this algorithm would say the next place I want to evaluate is not here where we currently think the minimum is because that's not exploratory enough over here we are much more uncertain about the value so let's evaluate over here and let's take that number there is a corresponding same statement in the paper actually that sounds a bit more theoretical and more vague that says oh let's not assume a Gaussian process prior that the true function comes from the RKHS associated with this Gaussian process prior as kernel and bound the norm of those of that function within the RKHS that's this kind of statement now you, since the lecture on the theory of kernel methods already know that that's almost the same statement because the norm bounded functions in the RKHS are bounded away from the mean by a quantity that is actually exactly equal to the posterior standard deviation that constant is the bound on the norm so you can imagine that this kind of enters here in beta and you get a very very similar kind of statement that is essentially the same as before so these are the kind of theoretically motivated algorithms that give a very simple kind of decision rule if you want to find the minimum of function on which you do regression with the Gaussian process just compute the posterior mean and the posterior standard deviation is a complicated annoying form that is easy to evaluate find the minimum so for that you need to use a numerical optimizer find the minimum of it and that tells you what is the next experiment you have to do this algorithm is still very popular precisely because of this theoretical guarantee that it has no regret however these kind of algorithms also have their limits which are in particular that they still try to minimize regret so the goal of these algorithms is not to over time it's not always true that you actually want to collect minimal function values maybe you just want to efficiently learn where the minimum is so that means you want to explore in a guided fashion not to learn the entire function but to learn about where the minimum is and for these kind of settings so this is the setting where you have a prototyping face where you initially have a budget of evaluations and you want to spread them but you only want to drop in shamelessly because I was involved in their development which are sometimes called information based search algorithms or because it is just a cunning easy to remember name entropy search I have to drop that because that was the title of our paper which were actually reinvented several times which are based on the idea that you actually want to construct this distribution over the minimum that I showed you before which is the red curve the distribution over the minimum and then to collect observations such that you are uncertain here with the minimum which is captured by the entropy of this red distribution drops as quickly as possible so you want to get informative observations about the minimum of this function and that means that you want to reason about what would happen to this distribution if you evaluate at some point so here are a bunch of hypotheses describe things that might happen if you evaluate at some point and all of them will change the distribution over the minimum you could do this kind of reasoning at every point in the input domain and then think about which of these expected changes to the distribution over the minimum do I like most which of them are the ones that predict best sorry that have the best effect on my knowledge about the minimum you can probably guess that doing so is numerically quite challenging because it requires a little bit of elaborate implementations and yeah but it gives rise to algorithms which actually have quite a rich behavior in particular they can address the kind of settings that are very difficult to capture in a regret formulation for example this prototyping phase where you have an initial phase where you can do experiments and you can even do things that are a bit dangerous because maybe you have a prototype and you can crash it and it doesn't matter you can throw it away afterwards but you really want to figure out where the optimal choices lie and then later on you want to make sure that you have a product that actually works reliably they also work in settings where you have a choice between different experimental modalities so for example in robotics or actually in product design you often have a choice between extremely fast but it also is like it's not actually the real world it's subject to some kind of modeling errors while a physical experiment is very informative but very expensive and you want to trade off between these two kind of prototyping channels so that's already a relatively rich complicated kind of setting and obviously you cannot phrase it in terms of minimizing regret because the regret of a simulation experiment is not comparable like the number you get out of a simulation experiment it's not comparable to the number that you get out of a physical experiment so I want to leave you at the end of this lecture actually with a video that wasn't made by me but it was produced by Alonzo Marco Valle a PhD student here in Tubingen which uses one of these very elaborate algorithms for Bayesian optimization information based Bayesian optimization to choose automatically between simulation experiments and physical experiments so on the left you see a physical experiment physical experiments are expensive to do but they are obviously very informative because you actually know how your physical system behaves simulation experiments are extremely cheap but they are also just a simulation so they are not totally guaranteed to be informative but you can model the information and you can ascribe a cost to the simulation and to the real experiment and then use these information based formulations these Gaussian process models to ask which experiment should I do in simulation or in reality multiplied with the cost of the individual weighted with the cost of the individual experiments to minimize so to maximize information gain per cost and then the algorithm is going to which in this case goes to learn a policy for some upright pendulum simulation or like set up a little robot like this will decide to initially do a bunch of experiments simulations to just learn a rough idea of what the lost landscape looks like this can be done with cheap experiments and then sooner or later the algorithm realizes that the information content of an individual simulation is now so low because we've done so many that it's actually necessary now to do experiments and then return to simulations again this is as you can guess a really rich kind of framework that's also numerically challenging to implement to which allows like basically to automate the design process of physical prototyping in industrial settings not just in medical settings but also in engineering settings and robotic settings and optimization has become a relatively standard tool by now there are also many software toolkits available for it now here are some that have been developed over the past ten years or so some of the first ones was a experiment from what was back then Ryan Adams Research Group in Harvard and there are various others by now there are actually also companies that are working on optimization algorithms that are just highlighted as a sort of open source thing that you can just use which is really useful because implementing these algorithms can actually be quite tedious and subject to all sorts of annoying numerical constraints with this rapid fly by of a large class of complicated algorithms we are at the end of today's lecture that you've learned you take a decision you saw that in the most simplest form these decisions are well either actually just easy because they just amount to minimizing some expected regret or loss or maximizing some expected reward in the setting of dynamical systems of course this can still be arbitrarily hard this is called optimal control and it goes beyond the scope of this lecture to take a sequence of decisions such that they interact with the learning process of a probabilistic machine learning algorithm we saw that in particular for the relatively simplistic setting of discrete choices on a bounded domain with bounded payoffs there is a class of algorithms called banded algorithms which are actually possible to motive we motivated them in this lecture in a probabilistic fashion to require particularly strong probabilistic assumptions they can even be phrased in an almost non probabilistic framework with improper priors and achieve a kind of good result a no regret kind of that can be captured in the notion of no regret where the expected sum of rewards divided by the number of steps over time goes towards the optimum but when we go to the continuous domain which is maybe more realistic you really don't get around making prior assumptions and then the framework of Bayesian optimization provides a rich language for the interaction between a learning agent and the environment in which it's trying to learn under the assumption that the learning environment does not react to the behavior of the agent so that's the IID assumption if the environment actually reacts to the agent beyond the scope of this lecture there is going to be an entire lecture course on reinforcement learning here in Tübingen I recommend that you take that if you really care about this kind of very challenging actually setting which really sort of then enters the realm of acting intelligent agents that interact with the world the goal of this lecture was just to think about reason about collection of information in an efficient fashion such that we can learn efficiently a rich framework to do so which contains both relatively simple to implement and cheap to run algorithms that have certain theoretical guarantees and extends all the way to quite elaborate methods which provide a rich language for the description of the kind of prototyping setting you might face in scientific or industrial development processes of where you can spend budget on exploration to learn about the minimum or where you have access to several different experimental channels modes of experimentation that give different forms of information at different costs and have structured beliefs over what the function is that you are trying to optimize and so on and so on these methods arguably are currently the state of the art of the implementation of complicated structured functions so for the kind of prototyping the experimental design that matters in settings like medicine and healthcare but also engineering and robotics and essentially all kind of industrial prototyping and this step out of the academic realm of just taking data and learning from it into an interactive phase of learning so one of the choices is maybe the end of our conversation about probabilistic learning algorithms in the next lecture we're going to revise look back a little bit and summarize all of the things we've seen over the past 25 lectures and I'm looking forward and hoping to see you there again for today