 Hello and welcome back to probabilistic machine learning lecture number 11 The past few lectures were maybe Food for thought for those of you are specifically theoretically inclined to want to understand exactly how learning machines work we took a long hard look at One of the most elementary forms of probabilistic machine learning, which is the Gaussian framework. We saw how to learn linear and non-linear functions even with infinitely many degrees of freedom from Observations from real-valued observations and we took we looked very deeply into the theory of these models We learned about concepts like reproducing kernel Hilbert spaces and the associated kernels feature spaces and eigen functions and various other concepts we understood both the Modeling and the computational challenges involved in doing so and Maybe over the course of all of these lectures you've begun to think This is getting on a little bit theoretical. It's a bit of an academic exercise How what does this all actually mean in practice? So today's lecture is going to be a bit different to put a bit of an antidote to this Load of theory we've been encountering in the past few days today We'll do a very hands-on lecture in which we take a deep look at one specific data set and think about how to build a regression model for a concrete application To do so we'll look at a data set that is very close and dear to my heart because It was genuinely produced by my own body or it reflects data about my own body. So what you see here as Scatter plots are recordings of my own body weight over the years from 2009 to The middle of 2013 when the recording ends. I've actually kept records records since then but for this exercise This is enough. I have cunningly removed the Numbers on the y-axis so that you don't know what actually is going on here So so that you don't know what my absolute weight was at any point in time In some sense. I've anonymized the data in this way. I can't tell you though that much I will reveal every single gray line here represents one kilogram and So what you can see here recorded in here are the usual ups and downs of life as a young academic so This data set starts during my PhD and basically covers the first few years of my post-doc life and The task we're going to try to address today Maybe one of them is to predict into the future How that data might continue to evolve and at this point Before we even look any further at the data or into modeling I want you to think for yourself. Maybe stop the video if you like about how you would predict this data into the future Maybe you don't even have to think so much About I mean maybe I've already spoiled you by telling you about features and Gaussian processes and regression models and deep neural networks But maybe allow yourself to forget about these words for a moment and just think about what you would have what you would do with Your mind if you try to extrapolate this data into the future What is your best prediction for where that number is in I don't know 2020 Once you've done that, maybe you've come to the conclusion that you can you can't actually Do much more than predict a relatively horizontal line into the future with a lot of uncertainty around it The reason for that is that I've really only provided you with very little information I've just told you actually I mean I've provided you with quite some information I've told you that these are weight measurements and that they are that they are measured in kilograms I've even given you a scale for this data and I've explained what that what the input is, right? That's exactly the kind of information you often get in a practical setting When someone gives you a data set they could Given what I've just done. I could claim that I've told you what this data is, right? Everything is properly labeled with made up with beta data. I've told you what the input is I've told you the scale of the output. Okay, maybe there is like a constant missing here But who cares, right? You could just assign an arbitrary constant What is difficult about this? the difficulty is that I Haven't actually provided you with the generative information for this data with the causal structure behind it Just by looking at this data, you can clearly see that there is structure in here But without understanding where that structure comes from it's very difficult to explain it to the future how it's going to continue and Now of course, this is a bit of a constructed Personal example, but of course there are real-world situations like this as well If you're working as a machine learning engineer in a company Typically you'll get data sets like this as well And then your first task is to go back to whoever created collected or owns this data and ask them much more About this data set. You want to have an interview with the person who created this data to understand as much as possible About where it comes from and where that structure comes from so that you can then include that structure in your model explicitly and Use that structure to make much more informed predictions about the future and that's what we're going to do today So imagine you just sat down with me and we're doing an interview about this data set So what I can tell you is going to reveal structure about this data set and That structure is highlighted in this plot. So here goes that story. It starts in about 2009. I Was back then a PhD student I was living in the UK in a small town called Cambridge, which has a very old university and it's a bit of a peculiar place people there live in old monasteries Which even even though they might be doing a PhD in computer science or physics as I was doing and One aspect of this weird college life is that you have access to food more or less whenever you want So in the morning when I would get up I could if I wanted to just walk through a beautiful landscape garden Into my college cafeteria and get a full English fire breakfast I didn't always do that, but I did it way too often and so at some point I began being very unhappy with my body weight and I started recording it which is how this data set was created and I actually went on a bit of a diet for a while. You can see that here in green over here However, I was unhappy with the results of that It was difficult to die in a place that had so much food and then something quite positive happened I Add a productive conversation with one of my colleagues in the lab Karl Scheffler Wiry young South African real ask cat and he Took me along on runs and I started running actually very seriously I went on runs three to four times a week To at the at the beginning just 5k but then soon 10k 12k 18k half marathons And I lost weight like crazy. Basically the story is if you're running It doesn't actually matter what you're eating if you're running enough. You'll always lose weight So that's this time here and this time ends around here I distinctly remember in at the end of 2009. This was the new rips conference early December I was in Vancouver in a nice hotel running on a treadmill feeling really great and good about the outlook on life and then I Went home from the conference and I went actually home to my mother's over over New Year's and Christmas to visit some friends and stay with family and I totally lost control of myself I kind of realized that the end of my PhD was in sight and I needed to get as needed to start working and The new phase started where I actually stopped I stopped running one of the problems was that my running mate actually at that point left Not because he finishes PhD, but because he wanted to work in Afghanistan during the war there actually And so I didn't have a running partner anymore I started I stopped running and I started eating again because I had to focus on my PhD While I was finishing my PhD. I basically gained back all of my weight as you can see over here and Then in October 2010 I submitted my PhD and I had a bit of a face of reckoning I could I could start thinking about the world again, and so things plateaued over here And then I actually moved to Tübingen and at the start of 2011 To start a postdoc and I tried to get my life under control again. I had a new face in life I had moved I had a new position and I Didn't go on runs anymore that much because I didn't have a running partner and also Tübingen isn't as much fun to run in as Cambridge because it's nowhere near as flat as you know and So instead something that's much easier to do in Tübingen is to go on a diet Because the food here is much easier to avoid and in many ways It's more quality than if you get canteen food. So I started losing weight again Until here is maybe a bit of a psychological effect of life as a young academic So I was in a postdoc position. I got stressed about my academic career I realized that I had to write start writing papers like crazy to catch up And I just hunkered down and started working really hard And so I forgot about doing anything like dieting or sports again for a while and Again started losing control of my weight again. So it went back up Then I had a bit of a career move and I found time again to go to the gym This is a third thing. I tried I tried lifting weights and doing like workouts in the gym You can imagine that that's not quite as Colorically efficient as going for runs, but it had an effect on my body weight as well Maybe as you can see here in this blue face And there was even a face in between where I tried out for a while to eat purely vegetarian I've always not eaten that much meat, but I tried to like for a while to exclusively eat vegetarian diet Okay, so that's the story behind this data set And what we're now going to do is To try and use the structure I just provided to you in this very very personal one-on-one interview to predict Into the future, maybe how my weight could evolve If I took up any of these individual actions again If I lost control and started eating again If I continue to go to the gym if I started going seriously running again and so on and so on So before we go get get into the actual modeling details. Let's recap a little bit. What's just happened here? I've provided you with Generative information about this data set. That's a causal structure for the Underlines for the sort of observations that you make in this data set what I've essentially done in doing so is that we've extended the input domain of this function from a one-dimensional space in time To a multivariate space where we have additional features you could think of these as binary features So here are one two three four five different Activities right running eating not eating going to the gym and eating vegetarian five different dimensions along which we've moved and All the information I've given you is essentially binary So there are certain phases in this data set when we were in that in this additional dimension of the data set We are either at one or zero because I was either running or I wasn't running of course Ideally, you'd like much more information. You would like to know exactly how many kilometers I ran on which day How much food I actually ate ideally you'd like to know the calorie content of my food Believe me, I'd like to know as well But it's just too much work to write it down And this is a very typical situation with data sets that you're always missing information is No one ever writes down everything But the more you know the more information you can collect the better your model will be so what we will do now in a moment is To try and get that information what we have into a model into a Gaussian process regression framework and We will try to treat this problem as much as we can as A natural scientific inference problem We're trying to learn a very simple law of nature, which is the behavior of my body under certain activities This is perhaps seems like my team like a silly example But this really in my opinion is what machine learning is it's the Mac the mechanization The trivialization of the process of scientific inference applying the tools of scientific inference To everyday data sets to massively expand the reach of scientific reasoning beyond the sort of very deep questions that Generations of physicists and natural scientists have to think about to Everything in our world to be able to predict with much more confidence and understand the complicated nature of our world much better So to do that, please follow me to my desk where we can open up a Jupyter notebook Okay, so as always we start with a bunch of boring stuff. We load up a bunch of Python libraries and Maybe the only thing to point out here is apart from the usual plotting stuff that I'm going to be using a library to deal with data that is time structured that has dates and times and We're going to use again. Just to recap this issue. We're going to be using very low-level libraries here It's linear algebra libraries from NumPy and from sci-pi from these they where I'm going to use various variants of Cholesky decompositions and Cholesky solvers linear algebra solvers Random numbers from standard gaussians to draw samples and of course NumPy The important thing is we're not calling Advanced machine learning libraries here. We're not calling PyTorch or TensorFlow or Jax or anything else and The reason for that is that gaussian inference is just that easy. We do not need more than that actually in the end we're going to be computing some gradients and One could wonder whether it might be easier to use as an automatic differentiation library for this I will actually leave that to yourself if you want to try it out Okay, so we've loaded these libraries and now of course we need to start with doing a little bit of IO So I actually already loads so that I can share the data with you later a data set that has been in some sense anonymized So what I've done is you can actually see the old code here is I've took the original data and I just subtracted the value at the beginning from it. So that's like a little my little secret So that you don't know what the what the absolute value of the weight is This isn't a standardization in the sense because standardization would amount to subtracting mean rather than the initial value And you'll see what the effect of this is later on And then there's a little fun story that back then I actually stored my data in MATLAB files because that was a phase when I don't actually know why I did that but I stored as sort of as MATLAB dates and There's an interesting story about the relationship between Python and MATLAB, which is that MATLAB is One based and Python is zero based the effect of this is that to get the data to actually be Be read in the right way. We need to take the Recordings of the dates and subtract one year and one day from them This is the kind of issues you often have with working with data from someone else That they might have been using a different framework and the only way to fix these kind of bugs Which can be very subtle is to actually look at your data and to understand what the numbers mean that are in there Okay, so I've already fixed that for us. So now we have a pair pairs of X's and Y's They're actually both univariate X is the time coordinate Y is the the weight measured at that day and We'll store the number of these Observations and then there is an interesting thing, which is that of course all measurements Have errors any physical measurement isn't is is always only up to a certain precision and This error will show up as a as a concrete quantity in our Gaussian inference framework as the noise Variance or actually the noise standard deviation of the measurements, which I will call Sigma and This measurement Error of course has units. It has the same units as the observations y so the observations y are measured in kilograms and therefore the measurement error is also measured in kilograms and Here I will set the measurement error to a hundred grams to open one kilograms Now I don't just do this randomly without having thought about it I actually know this because I use a scale to measure myself to weigh myself and that scale actually says on the back that It is precise up to a hundred grams Again, we're actually doing physical modeling here and physical quantities matter So if there is a quantity in your model that you actually can know like this one It's a good idea not to try and infer it We shouldn't be uncertain about things we actually can know because it is going to drastically simplify the influence If we know more and more variables about our data, I don't have to track them as latent variables Okay, now we're going to do a Little bit of plotting as well. So to prepare for that. I'm Going to take first of all going to transfer transform all the input variables X which are stored in essentially like floating point numbers that represent dates in years and minutes and and transform them into Python date time structures and then create a plotting scale that Ranges from the start of the data to the end of the data plus one year plus 365 days and put a grid with 500 nps on that again This is maybe useful for some of you who are confused by how how Gaussian process inference works even though the theory is On infinite dimensional objects objects that are processes in practice Of course We are always going to compute with finitely many representatives of this data and here we've decided to test if you like or predict on 500 points across the data range and then on top of that one additional year Okay, that's 500 and We create a corresponding plotting grid like as well by Transforming these floating point numbers into a daytime object so that plotting is easier. Okay. Let's do that and Now we know what the range of the data is just to double check. It's a little bit of sanity checking. It goes from the 4th of March 2009 until the 26th of July 2013 Okay, and now the first thing you do and this is not just for fun. It's actually like maybe a teachable moment the very first thing you do with your data is To plot it Sometimes there's a little bug. Let's fix that So here we're getting a plot this is the plot we just saw on the slide So that's reassuring. We now know that we actually get to see the right stuff here This is what this data set looks like it's a little bit small, but you can see the dots maybe Let me see if I can fix that It's too small Here we go. Now. You can hopefully see it on your screen as well. Maybe even larger like this, okay This is important and you wouldn't believe how many people don't do this It sounds like a totally trivial thing Of course you look at your data and everyone in a position like me keeps telling you that you have to look at your data And yet people in practice often don't do it So it's important to do this because now, you know that you've actually loaded the right data You can also check whether there's something that's gone wrong with your data You can check whether the x-axis is right here The scale is a bit bad, but I can tell you that it looks good You can check that the y-axis is right. You will actually notice maybe I Can add a grid so that you can see this That this here is zero actually so the data is Starts at zero because that's how I've created the data set. So it doesn't have mean zero it but it has a Range that goes from zero to about minus eight in total. Okay, and now what I'd like to do first is to In a quantitative way redo this thought experiment that we just did in an abstract fashion Previously to just to to demonstrate and illustrate again that if You don't know anything further about your data Your ability to extrapolate is very limited So imagine again, I hadn't just given you this little start like this little list of personal Embarrassing details and instead just giving you this data set and give it has told you as an exercise to just extrapolate into the future Then it's very difficult to come up with a model that explains what's going on here particularly well So maybe the naive thing you might be doing actually again This is something many many people in practice actually do and we'll see that it's not a good idea is To just do Gaussian process regression with a generic kernel on this kind of data set so in previous lectures I've introduced Gaussian process regression and So you've all actually seen code like this before you know how Gaussian process regression works. We first define what a kernel is and So that's the abstract definition of a kernel function then define a concrete actual kernel this time It's the Gaussian or square exponential or radial basis function kernel that we've seen many times in previous lectures And by now you know that it's actually in many ways a bad kernel, but why not just use that? I mean any other kernel is going to give relatively similar kind of extrapolate extrapolatory behavior Let's just say Well, at least any other stationary kernel we use a kernel that is the exponential of minus the distance between input points squared divided by two times a length scale and we'll set that length scale to something Let's think about that in a moment now We will define a Gaussian process prior which it has a mean function will set the mean function to zero and We could also set it by the way to the mean of the data but let's just set it to zero to see what happens because that's what everyone always does so I'm trying to create a bit of a strawman here of the the silly things people might do in practice and Then define our kernel and for that we have to set this hyper parameter of the kernel here is the first interesting insight It's important to understand that these hyper parameters mean them something you cannot just set it to one and hope that it works Similarly to how you can't train a deep neural network without first standardizing your data So this length scale here Is the stuff that divides the inputs so the inputs are and both squared right so the inputs are points in time Right and they're measured in days so the units of this time measurement in floating point numbers is the days So if we set the length scale then by doing so we're essentially defining What's the time scale is on which we are expecting this data to vary well? We can go back up and look at this so every vertical bar is one year So clearly the scale here shouldn't be a single day because otherwise this number this line It's just gonna wiggle up and down like crazy, and it's not really going to predict anything interesting. It's just going to be more less noise so instead let's Set it to 30 days, which is roughly one month. Okay, I'll do that now we can do standard Gaussian process inference so we Okay, I'm not going to redo this code It's basically just we need we need these three different variables three different matrices the Covariance matrix of the training data with itself the covariance data takes of the prediction locations with themselves and the covariance between the test and the train points Then we do a little bit of linear algebra, which by now you've seen many times and just run this and that's going to take a while because it's actually a reasonably large data set and Then we can plot and you'll see that you get this kind of output So there's a few interesting things to note here First of all, of course, we're getting exactly the kind of behavior that we already feared we would get this Model is very bad at predicting into the future because as we move away from the data It only it simply returns to the prior mean I mean it creates a nice little beautiful plot with a bunch of samples and everything is nice and smooth, but This isn't particularly useful for anything right. It basically means the data is forgotten after a while But there's more interesting structure beyond that for example, we can also see that there's actually more variability in the data Then is predicted by this model So remember that the measurement error is 0.1 kilograms, so it's quite small on this scale. This is 0. This is 2 4 6 Right, so this deviation from the mean is not explained by measurement error So there's something else happening on top here, which is not currently Explained by this model, which is so confident about its Output in in between the data Another thing maybe to talk about is the width of this sausage of uncertainty Here at the end of the data. It's it has a width of two above and below the mean That's because I'm plotting two standard deviations and I've implicitly said the Scale of this kernel the output scale to be one. I Haven't actually told you this I've just sort of glensed over it and not said anything about it and the That was basically actually a little bit deliberate, right? So this is a case where there is a parameter in your model Which you might not have thought about if I wouldn't have told you about it We would just have ran past it and not thought about what it actually means So in fact this shows up very very specifically here So if I decided to scale the kernel so I could introduce an output scale Let's call it theta and let's set it to three then We can scale the kernel with it Like this we run the code Then we will get I'll go down here so you can already see it and we will get a plot that is just wider Right down here. So now the data is basically scaled down and we're getting a much larger posterior. I Could also make that number smaller not set it to three but to about to 0.1 it's just maybe a silly choice, but let's do it and Then the corresponding plot would be much much more narrow and this is of course also stupid so This is important to understand even if you believe that there are no additional parameters to your model They might be implicitly there and they might just be set to one or two zero or Depending on where they show up in your model to some kind of standard value that you haven't thought about So just by ignoring them You're not necessarily getting rid of them because they might explicitly show up in a plot like this So what you have to do to address this issue in practice is to actually look at the plot like this and Think about whether it actually represents what you wanted to represent and if it doesn't then You need to do something about that and fix that in your model Okay, so there is no gray slide here But I could go back to this one right and tell you this was the first part of of this exercise We've just seen that using blindly using a standard toolbox like in this case standard Gaussian process regression with a standard kernel Is not a good idea on real-world data about which you know something concretely because it's going to predict badly because it's going to be badly scaled and Because it's going to hide hyper parameters that you should actually choose yourself and This sounds like a totally trivial statement but you wouldn't believe how often people in practice in industrial and Scientific applications use these toolboxes in this way and then are surprised that they don't work Well and just think well, that's just how machine learning works. That's the best I can expect so if you want to fix this issue you actually have to know what your model does and that Patently means you have to understand the math and you have to understand the computations and you have to do them Right, that's what's going to make you an expert by following lectures like this one and other ones in your master class Okay, so to fix this this model basically to make it actually powerful and useful for something what We need to do is actually we can do this while looking at this screen again is To X to introduce this causal structure that I've just provided to you in this mock interview Into the model and use it to make it more powerful and that involves various tasks. So first of all We're going to need to find Ways of representing in our model these causal structures these specific let's call them lifestyle choices at various points in time in this model and Those are going to correspond to well There's two different ways to think about them one is that we could think of these as individual extra input dimensions as I mentioned before So at a particular point in time you can imagine that in this phase and in this phase There is another input variable. Let's call it gorging that is at a value of plus one here and there and Zero everywhere else and then there is a variable for running which is at plus one here and then zero everywhere else and so on Notice that this is of course imperfect, right? There were other days in here where I went for runs, but we don't know about them so that's a fundamental issue with modeling you have to work with what you have and You could always want better data, but you're always just going to get data with a finite quality Once we have these features there is an additional issue that we have to address Which is which we actually just saw when we looked at this basic pedestrian model this standard Gaussian process regression model Which is that this one model that smooth model we used Didn't actually capture the dynamics of the data well. It was too confident in this prediction in the regions where Even well, there were no features there yet But even in in basically all the regions the posterior variance was too narrow Around within the observations It did the the deviation of the data from the posterior mean was not explained by the sum of the measurement noise and the posterior variance So we so on top of the features We also need an explanation for the stuff that goes on in the background there are Clearly there's clearly structure in here that is not explained by the features Where that structure comes from you can have a long Winded theoretical debate about right you could come up with all sorts of explanations of where these deviations and this internal structure And a few outliers come from the fact of the matter is that you can only build a finitely good model about them Because we don't have access to the true causal structure that caused all of these minor variations So these are things like just to put them into a concrete into concrete terms Things like I went to weddings at some point in between and birthday parties and barbecue parties in backyards Which caused spikes upwards and I had the flu at some point actually I had swine flu at some point in this data set I think and various other illnesses that caused baby dips downwards or Because I felt bad and didn't eat for a few days things like this So these are all hidden in here, and they are not explained by the features at all We need a model that captures these you could think of this as measurement noise, but it's not the measurement noise of the scale It's not a physical measurement error when I step on a scale It's actually a background process that also causes deviations and we need a way to encode this sort of a missing structure Into the kernel for a background Gaussian process So what we are going to make the assumption we're going to make is that the true function We were interested in is a sum of various different functions error or noise functions in the background that explain this kind of deviation and the concrete causal processes which we're modeling with these individual features so I'm going to do this by introducing parametric features for these these causal structure and and kernels non parametric kernels for the background processes and As I just said you can think of these causal structures as additional input variables or you could think of them as additional features I've actually already started to mix up these these words in what I just said the reason for that is that if you if you Remember how Gaussian process regression works the input dimension X is actually only ever evaluated in features or in kernels That was the original trick that we used to introduce the kernels in the first place X never shows up lonely So to say in our regression model It's always first evaluate like we know it's only it's always first shoved into either a kernel or a feature function So it doesn't actually matter what the input dimension of your input space X is as long as you capture it properly with features So what I'm going to do in the code is instead of introducing additional variables. I'll directly encode the features Whether this makes for beautiful code or not You can decide for yourself and I very much invite you in your homework Which is going to be related to this issue to do this in a better way and show it off in our tutorials And maybe in the flip classroom and then we can talk about it So let's go back to our Python code and talk about what we're going to do So here's a little bit of math to explain again what I just said on the on the slides We're going to assume that the function we care about That's f of t the the function that actually explained the data the generative process for the data that that's a sum of a bunch of Individual functions which we treat as independent of each other because maybe they actually are they're a causal structure in my life That is separate from each other. There's going to be two noise processes We I'll call them f s e and f w for venor and I'll tell you in a moment what they are and then there are a bunch of causal functions which are given by Individual functions for the individual lifestyle choices that there will be in total actually. Let's go back. There will be one two three four five of these one for running one for eating too much one for Trying to lose weight by dieting one for going to the gym and one for eating vegetarian and I'll assume that they're all just a Individual parametric functions. So these can be written as an inner product between an Individual weight and a bunch of features So in fact actually this transpose here is sort of superfluous because there'll be a scalar parameter for all of these in general Of course, there could be several parameters But I'm going to choose a set of features such that there's only a scalar feature for every single of these functions And we'll talk about why we do that in a moment when I define these features So let's first start with the noise processes This is the stuff that will have to explain what goes on in this data beyond what's explained by the features and I will assume that there are two different things at play here. There is one kind of source of disturbances that is self reverting mean reverting if you like which is The which which I use to Capture the kind of processes that just go up and down over your daily life So for example, I stepped on I tend to step step on the scale every day in the morning But I don't always do that and of course even in the morning You're not always in the same state relative to your sort of your running average, right? Maybe this depends on like how much how much water I drank the day before and so on right various states of the Internals of my body keep going up and down and that process doesn't over time just deviate away up or down It doesn't it doesn't just keep growing or falling. It's just a sort of internal parts of the process So for that I'm going to use a square exponential kernel, which we've used before So this is basically this kernel right but with different parameters But this kind of smooth interpolant kernel that just reverts back to the mean This will capture the ups and downs of daily life And then there is an additional process which describes situations like as I should describe right I I get I fall ill I Go on vacation I go to to barbecues and weddings and so on things to celebrate where I gain weight and These things are not self-reverting right? So if you're going to a barbecue party and you ate and you eat way too much Then it's not like two days from now that way. It's just gonna be gone. It's actually still added to it So you can think of this process as a bit of a random walk in life like a brownie in motion And that's exactly what I'm going to model So I'm going to use a kernel here that is the vener process So the vener process is the kernel that is given by the minimum between the two inputs We've encountered it in the Gaussian process lecture lecture number nine and we'll use that to model Random drifts up or down that do not then over time Naturally decay away again and we turn back to the mean I'll add both of these and of course each of these kernels will have Parameters they will have as you could as you just saw up here They will have a length scale and they will have an output scale Actually, that's the case for the square exponential kernel the vener kernel because it's the minimum kernel only has one parameter Which is the scale of the drift so this defines in physical terms the expected distance This is called the diffusion constant in Einstein's theory. It's the expected distance that a particle or in this case the weight drifts over Time of Over a unit timescale and then they have to define what the features are So for that actually, I can already run the next next piece of code So now what we're going to do now is we're going to build these parametric features and to do that We first need to know what they actually are. So this cell here basically encodes the The result of the interview the mock interview you just had with me on screen So I told you about these individual choices and I told you when I started and ended them by Showing you this picture with the colorful like labels, right? So I've just transferred these labels into actual numbers So I I'm telling you that I started running on the 1st of July 2009 and I stopped running at the 5th of December 2009 and so on and so on that's actually the end of the NIPPS conference I think that year and Then that's the data. I started typing and ended dying and actually then I started again and ended again I did that twice and to the gym. I did that until the very end of this data set actually Because this data set still ends at the point where I was still going there and so on and so on so Did I ate vegetarian for roughly a month basically just over a month and so on and I did twice There are phases when I ate too much So this is additional information that is actually maybe part of your x input But we didn't have it in the original data set I just provided it to you after you came back to me and asked more questions This is what you would do in an industrial or scientific setting as well You go to your data owner you talk up with them about what else they know and in doing so They basically provide you with features and then you make use of these features now Really if you're precise, they only provide you with additional information like this one Your job is now to turn this additional information into a feature So how do you do that? well for that you will have to make a concrete decision what your thought is about the the causal effect of These choices on to the data so what I'm going to do is that I will claim that each of these choices adds a linear function to the data What I mean by that is that Whenever I go on this kind of running I lose a constant amount of weight per day and Whenever I eat too much I gain a constant amount of weight per day. So there's a constant derivative. That's a linear feature This of course is a questionable choice you could argue that that's maybe not true maybe you think that If you for example, if you go on runs then Over time you'll at some point reach saturation right where you your body doesn't actually lose more weight Because it's reaching some kind of new steady state. I don't think this actually happened in this phase Maybe it happened in the gym phase right So by making a linear assumption. I'm making a small mistake Now the fact of the matter is all scientific models even the most advanced ones Not the ones for trivial data sets like this, but for really complicated stuff Have make these kind of simplifying assumptions because you have to work with something and every other choice you could make is also going to have I'm going to encode some kind of prior assumptions For example, you could instead say that let's say if you go on runs then there is this kind of decay Well, it's easy to say there is some kind of decay of the of the decrease But what do you actually mean by that you have to decide on a concrete shape, right? Maybe you think it's an exponential decay up to some intercept Then you need the rate of the decay and the intercept those are two numbers. You have to set them somehow these are parameters of your features Maybe you think there is some I don't know some weird oscillation or something then for the oscillation You need a period that you have to set somehow so All of this you need to actually choose yourself because you're the machine learning engineer Of course, you can take the data and use it to help yourself by Using it to set the unknown parameters of these features But even if you just choose like even just by choosing which features to use even up to parameterization You're still making prior assumptions So what I'm going to do is to assume that these are all linear functions So they all have a constant decrease and one advantage of this is that I can write this as a linear function Of course, right? I can write this as a feature times a linear weight We'll notice in a moment that does that that does not absolve us from having to set hyper parameters But at least it's easy to do so I do this in the code here and the next cell I'm defining feature functions and again a I'll leave it to you to think about whether I'm doing this particularly Well in code here, so I'm defining individual features. Let's say here is the feature for running that's already be easy And what I need to do here is need to define a function that maps from the input domain Which is time to the real line and that's going to be a linear function But because it has start and end times it has to take those into account and then we have to be careful to assign Correct physical units of measure to these quantities. I will decide to use so if you're using units of measure You have to decide what the units are. I've decided to use the units grams per day In for these features so those will then be interpretable as meaning this is how many grams you lose per day If you're doing or lose or gain if you do this activity So here is the feature of function for running which is a function that takes in an input time and then if the time is less than the beginning of the of Me running then the feature is just zero and if it's within the The time when I started and ended running that it's a linear function that starts at zero and then increases linearly Across time from the starting date to the end date and then at the end of the Of the running phase it just becomes a constant function just stays constant This is important right you could think maybe think for yourself about other ways you would have encoded this kind of linear feature It's not the fact that this feature does not start at zero Go goes all the way up to the end and then drops back down to zero You can think about what that would do to our regression model. It stays constant It also starts at zero at the beginning of the running phase. It doesn't start anywhere else again You can think about what other effect this will otherwise have Now this is maybe easy here's another feature for dieting dieting as the interesting aspect that I did it twice So there is two start and end dates and what this coding code So it's a bit nasty to look at you can do that slowly if you want to stop the video is that it measures when Like it the feature is zero before the start of the first dieting phase and then it grows linearly until its end Then it stays constant and then starts growing again until the second end where it becomes constant again One other way to think about this is again in terms of input domain So you could think of dieting as a binary choice that so that each data point each datum is Either at zero or at one it's at zero if that's doing a phase I didn't diet and at one in a phase when I did diet and then what these features and code is an integral over these days over the values of this underlying input dimension per day So across days and at the end I'm dividing by a thousand This is because the output variables are measured in kilograms and I want these features to be interpreted in grams so I'm dividing by a thousand and Because this is measuring time here It's counting days the total units of measure of this object of this feature value are grams per day And then I do that as well for the other for the other choices for Eating for eating vegetarian diet and for going to the gym fine And now what we can do is we can make a plot so that you maybe have an understanding of which has been wrong I forgot to run this cell. Let's do it do it again make this plot Now what you see here is I'm plotting the data set from below just so that you see it And then in the back with using random colors so the colors don't mean much I'm I've defined what the features are so here for example, this is the running feature It starts at zero then it goes up to a constant value and then stays constant for the rest because I according to this exercise never started running again and then Here is the dieting feature which has a first phase and a second phase Here is the gorging feature which has two phases as well And here's the gym feature and the vegetarian feature now notice that they all go up to some value And of course it looks like these are not good models for the data because they go up But what we're going to do now when we do inference machine learning We're going to learn the weights for these features which are going to turn them around up or down to learn positive or negative weights for them So to do that we now this this is writing down our generative model if you like the feature functions. We now have to actually a Assign a probabilistic model to these This structural description So to do so I will define a kernel a joint kernel which captures the sum of all of these features and The two noise processes remember sums of kernels are kernels So the sum the covariance of a sum of Gaussian process functions is itself a Gaussian process functions with a Function with a kernel that is the sum of these kernels So let's do that Here is actually what I would like to write. This is a maybe more It's actually a kernel that takes inputs and hyper parameters and then does this in a nice functional way I had to learn that that's not an efficient way to do it in Python So you can look at that in the in the in the notebook Which I'll put on on the Elias afterwards if you like, but this is actually not a computationally fast way to do this because Function abstractions don't work well in Python. So instead I've implemented another version, which is a bit more pedestrian but so it doesn't it doesn't expose an Anonymous function properly, but It's a it's much much faster to to evaluate So we already run that so what this function does is this is really like the core of our inference machine It's the kernel It'll define a joint kernel that is the sum over the individual processes the co variances of the individual processes So that means it's a structure of a kernel It takes as its input to inputs a and b and a bunch of hyper parameters which define the model So which we will learn in the end. So these hyper parameters have physical values as well so These are the these are all the parameters that the kernels and the features have as well What are those? So let's first unpack them. I will actually store these in As as the logarithm of their actual value because that makes Optimization easier. Let's forget about that for the moment. We'll talk about it more at the end of the lecture. So There will be in this total model One two three four five six seven eight different hyper parameters these so hyper parameters are quantities that you have that you cannot do Inference on because it's computationally too expensive instead I'm going to optimize them by maximum likelihood and I'll show you how I'll do that So what are these parameters? There is a length scale for the Gaussian kernel that are used for the periodic not periodic But for the mean reverting process that will define the scale in you measured in days On which weight my rate and just sort of moves up and down over time according Caused by some random processors that self-revered then There is an output scale for the kernel For this kernel for the square exponential kernel So that's the amount of data that has to be explained relative to everything else by this mean reverting process Then there is an output scale for the vener process for the brownie in motion for the random walk behavior that's actually You can think about this as an output scale or as a drift constant as a diffusion constant that is measured in days as well that's the Actually, sorry, it's measured in kilograms. That's the amount or the grams. Here you go in grams. That's no in kilograms that's the amount of weight diffusion per unit times in this case per day and Then there are scales so intensities for the individual features So why do we have to those? Why do we need to have these? So these are going to be multiplicative in front of the individual these individual features We need to have these because each of these processes has a different intensity Going on going on runs has a different effect on my body weight than Eating vegetarian for sure so we can actually almost see this from the data Now because these show up in a sum they they are relative scales actually matter What these correspond to in our prior predictive model are the diagonal entries on our prior feature covariance matrix capital sigma This is not going to be a scalar matrix It'll have individual components and they these will capture the Relative strength of these individual effects on each other Notice that annoyingly we cannot learn these big in a closed-form Gaussian fashion because they are part of the prior covariance matrix So we just have to live with the fact that we have to set these So let's say you set these then what this what does this function actually don't do so here I will Be relatively superficial as it's probably best if you look at this code yourself on Elias I define first as quite exponential kernel. We've already done that above So it's just the same thing again. Now. I'm just a bit more careful about the The units so I first defined the kernel without the output scale again is the exponential of Minus square distance divided by two times the length scale The time scale on which things move up and down. Oh, and then I actually divide by a thousand so now we know what the output dimension is it's grams per day and That's going to be our kernel and we'll in the end will scale it actually with a relative With the relative output scale that will happen at the end of this code Then there is the Vina process, which is the kernel that is given by the minimum between two inputs We've used that in lecture nine if you want to know more about it go back there There will be that this kind of leads an offset because it describes a stochastic process that has a distinct starting time So I need to put that somewhere I'll put it to the beginning of the data set minus one day to make sure that even at the first day There's already a bit of flexibility in this model to start the diffusion So that we're pretending that the random walk process starts one day before the data starts. That's fine Why? Because I've scaled the data such that the very first datum is at zero If I hadn't done that this would not be the right scale to set Maybe think about why that's the case for yourself then I define the Vina process which is Kernel that takes as its input a and b and then returns Shifted by the offset the minimum of these two numbers and again divide by a thousand squared So square because it's a variance that will give me units of measure that are grams per day That's our Vina kernel here we go Then this so this function is now written in a way that actually already takes the inputs and evaluates these kernels We have matrices of corresponding rectangular size in general then we evaluate the features For this we just have to take the inner product between the individual features because that's the covariance or between these individual functions with themselves and Then we sum up all of these to get one kernel and here at this point We actually can multiply them, but their individual covariances. These are the scales for the individual outputs That come these are all individual hyper parameters and they define the relative size of these individual stochastic processes relative to each other and Then I'm actually also computing the rivatives of all of these. Let's call them kernels relative to each other You can try for yourself that that's actually the correct way to define these derivatives So what are these these are derivatives of the individual kernels with respect to their hyper parameters and I'm not going to do the derivation here. This is actually usually quite straightforward. So for the for the output variances is quite simple it's just you just take the Derivative of like that side this this kernel function is just that individual parameter times the kernel So you just take the derivative. That's two times the scale times the kernel now I'm going to store these individual hyper parameters as their logarithms So if you take the risk their derivative of these expressions with respect to the logarithm of this parameter We have to multiply with the derivative with respect to the parameter with the derivative of the parameter with respect to its Logarithm which amounts to just multiplying with that parameter again because you can think of the log of You can think of the parameter as e to the logarithm of the parameter and the derivative of that thing with respect to its Exponential is just the exponential function again. So there's a square here again. The only other term is the more interesting one for the Square exponential kernel there is a derivative with respect to the length scale Which is a bit more tricky and you can do that for yourself on a piece of paper if you want to so this is the point where you could Wait where you could also implement this individual expression for an automatic differentiation for in work But I'm not going to do that So what this allows us to do is now I'm going to define some first guesses for what these parameters are and We'll assume that the length scale for the mean reverting process is one day that the length scale for the the stationary The output scale for the stationary variation is something like 200 grams I did that by looking at the data a little bit and 50 grams for the diffusion process and 10 grams for each of the features because I don't know you take 10 grams per day for each of the features Because I don't really know yet how good they are if we do that we can then do essentially Gaussian process regression right or actually first Check whether this model is any good by and this is again an important point something you do at first By drawing from the prior So this is something you can do in a probabilistic model that is Basically impossible in a statistical machine learning model if you have a prior You can draw samples from the prior and use that to investigate whether your model is any good I do that by building the kernel ground matrix over all the data Adding a little bit of a nugget to make it definitely Symmetric for the symmetric positive definite multiplying with a random number adding a prior mean which I've set to zero So I'm not adding it and then just making a plot and this is what this plot looks like so this is the kind of plot you should definitely do if you can and Use it to see whether your model is good. What what do we actually do here? So what I've plotted here is I've plotted the true data set and On top three samples. So here I've plotted you can see this up here I've plotted the true data set in the same color and then three random samples from the model So one thing you should do now is to look at this plot and decide for yourself whether you can still make out the true data Maybe you can and the reason for that is that we haven't set the hyper parameters of the model quite right yet So you can for example see that the output scales So I mean clearly you see what what the data is and what the what the samples are right? There are these three more compressed lines And and what's clearly happening here is that a the output scales are wrong So probably these stochastic processes have to be a little bit wider And then be the scales of the individual features are probably wrong This data doesn't move quite enough in this region or in this region and it moves more than enough in this region Then it should so this means that the relative scales of these features are off and we need to fix these So we'll do that in a moment But before we do that, let's maybe take a quick break here and Understand what we just did we defined a physical model that captures our Assumptions about what's actually happening in the data and then we implemented it and then be tested it by Before even looking at the data properly Sampling from the prior I mean before allowing the model to look at The data we looked at the data actually sample from the prior and compare it to the data to get a Feeling for whether we've even described something meaningful and maybe looking at this Picture it does look like we've constructed something interesting that does captures a lot of the structure But the scales aren't quite right yet the hyper parameters of the model aren't right yet So what we're going to do now is hyper parameter inference hierarchical Bayesian inference to fix exactly this issue So at this point you might be looking at the clock and and I'm thinking well now We've already talked for one out for one hour. What is he going to do in the next half hour? How is he gonna finish all of this? It seems like we're only beginning to do Actual work on this data set. Well, in fact, we're actually almost done now because we've done all the hard work We've defined the features. We've figured out the all the right physical scale Not the right physical scales, but the right physical quantities to describe our model We've collected the data and now all that's left to do is the mechanical part is the computational part It's machine learning. So Let's go to our code and see what we can do next. So We've created our samples now all we are left with we've left to do is to compute marginal Likelihoods so an evidence for the model Optimize the parameters of that model to maximize this modern likelihood and Then evaluate the posterior. Let's do these one after the other and I'm not going to go through too much detail of how This is actually implemented. It's probably best for you to just look at this code yourself and Compare it with the more abstract derivations. We've seen in lecture nine. So we're going to evaluate the Log likelihood actually we're going to evaluate minus two times the log likelihood because that's a function We can minimize through to the minus and the two doesn't matter It just simplifies the expressions because we're just looking for the minimum of this function And we'll also compute gradients of this to hand them to a numerical optimizer So let's do that in this function. I'm defining what the like log likelihood is I'm essentially Implementing this function up here. So for that we need the data. We need the values of the hyper parameters and Every time the optimizer is going to evaluate this log likelihood function we will compute the corresponding kernel matrices and The derivatives of these kernel matrices with respect to their inputs Then compute the kernel gram at that the the covariance matrix of the data, which is the kernel matrix plus the noise covariance then compute log determinants and So this is actually the log determinant of the symmetric matrix So that the positive-50 matrix Compute this expression here in the middle this quadratic function That's quadratic form where we take an inner product and for that We solve once from the left-hand side or G and then multiply with Y from the other side and Compute gradients so for this I'm using the results that we had on the corresponding slides for differentiation in lecture 9 Just compute that so this is actually easily implemented now At this point what I've done is I've implemented a gradient And I've told you in previous flip classrooms that whenever you implement a gradient you have to make sure that it's actually the right gradient so to do that I will include in this Notebook even though not actually going to run it now because it takes too long But you can look at it later a piece of code that actually compares the analytic derivative that is implemented here in these three lines With the numerical value for the derivative which you can approximate by a finite difference So we just take individual values of all of the hyper parameters and for each hyper parameter once evaluate to the left and to the right and take the finite difference and This balanced finite difference and divide by the distance and then compare these log likelihood values to each other So You can actually look at what the output of this is this is essentially a test to see whether the analytic computation and the Numerical computation are the same and then I'm also plotting the relative distances and you see lots of small numbers here Everything is much much less than one. So therefore this computation is reliable. It actually computes the correct gradient Good, that's just a unit test. Of course, you should always do unit tests, but We shouldn't spend too much time on that now. It's easier for you to just look at this yourself So we're just going to go to this cell here here. I'm loading an optimizer This is one of these black box methods We're not even going to talk about exactly what this optimizer does for once Let's just rely on a little back box and what this thing is supposed to do is it will try to minimize the Negative marginal log likelihood of our model So it it's going to try and make our model predict the data as much as possible under the prior So to make the prior as close as possible to the actually the actual data We got you get to see to do so. I'm handing this function the the handle to our log likelihood function which returns function values and gradients an Initial value for the hyper parameters the one we set above and draws two samples from I'm Telling this function that there is actually a gradient that Jacobian to compute So please use that and try it and then I'm telling it to do this at most 15 times and To display the output so I'm actually gonna watch this a little bit with you as it's running So this takes a little bit of time and what you can see here is in every time you get an output because I've plugged I've written the corresponding code into the Piece of code above that this that computes the log likelihood we see where the optimizer tries to evaluate the hyper parameters and plot out the negative log likelihood, sorry the actual The Negative log likelihood. Yeah, so this is a number that we're trying to get small Rather than large the first evaluation that the optimizer makes is at a value of five thousand three hundred and seventy one so that probability for the data is minus But e to the minus five thousand three hundred and seventy one. That's a very small number That's why we are using logarithms of likelihood rather than likelihoods Otherwise, you would get into trouble with our floating point range the next time we the optimizer twice something We are at a much much better value So remember that we're trying to get this small the optimizer does something smart It plays around and actually then fun and has to try a few larger values This is while it's doing a line search to decide how far to step and then finds an even better value And now as this thing iterates we will quickly see it find relatively good value So these are about a factor of 10 or more actually more like a factor of 20 less in log space Which is extremely more likely right for the data While this runs We can basically lean back and watch our machine learn right so the process that you're observing here is literally the machine learning It's trying out different values for the hyper parameters of the model trying to find good values And you can see that this takes a little bit of time, but we can actually do it while we're watching it I'll compare that with deep learning where you might have to Wait for a little bit longer. You'll have to you might have to play about the different optimizers You might have to sub sample the stochastic Optimization use stochastic gradient descent and so on and you might have to rerun this code more over and over again Here we are done once we've run it once and we can rely on this optimizer doing something meaningful because it's not a stochastic problem It's a deterministic problem our optimizer tells us that it has done 15 iterations which required 19 evaluations of the function and it has reached a value of 258 which is a reduction in log space of what a factor of 20. That's pretty good And now we can see or ask code what were actually the values that the Optimizer chose so These are the values that it found these are the corresponding values of the estimated inverse Hessian of the objective function at that point None of this is actually all that important for us The most important thing is that this choice here Those are the values of the optimal hyper parameters Remember that these are the logarithms of the actual hyper parameters because that made optimization more efficient so instead let's print the The exponential of those values and now we can look at these and remember what they actually all are and they will give us an interpretation of The numbers that this thing has found so remember I like to go up a little bit The hyper parameters were defined to be maybe I should have Copied that somewhere Here we go. You know what? I'm just gonna copy this down so that we can look at it together So the optimal found hyper parameters Put them here. This one is the length scale of the Stationary noise process. It's actually less than a day this is the Output scale of the this stochastic this We mean reverting noise process. It's actually it's on the order of grams, right? So this is about six hundred grams half a kilogram of up-and-down motion Just in this noise process This is the scale of the diffusion time. So that's in grams per day. So we're talking about 89 grams a day of random walk up and down That may seem relatively sane as well. And then these are the relative scales in grams per day of these individual Causal processes. So we see that they are not the same anymore even though they were initialized to be the same and That they have different values in particular the one for running is quite large compared to the other ones is something like 90 29 grams per day for for running something like 0.9 Grams a day so almost zero for eating vegetarian and so on but those are just the scale of these of these priors They are not the actual values the predictions for the weights of these features to get those we now Actually compute our Gaussian process posterior over the noise processes and the weights We can do that relatively simply by applying the linear algebra part of our model. So that's the the the Basically the code again that we spoke about in lecture nine and also before that in lecture seven so We can now Actually do that one after the other the first thing I'm going to do is to draw samples from the prior of This model. So that's the model that has been tuned such that it's hyperparameters Are creating a prior that is hopefully well suited for the data set So let's check whether this is actually true by drawing a bunch of samples from this prior and comparing them again With the true data. So in this plot that you see here somewhere hidden is the true data And there are three other samples. I think it's three no five other samples from The prior of our model now the question to you is do you still see the true data? Well, maybe you do because you by now memorize what the data looks like and Because you know that in the first phase it has to drop and not rise but other than that. This is actually a pretty good Prior distribution over potential outcomes. I think if I just run this a few more times you will See that I mean, you don't really there's nothing really that stands out. There's not really a reason why one of these plots Stands particularly out over the others So that means we've done a good job We now have a good prior model our Gaussian process prior is now a pretty good one for this data set and we can compute a Posterior so the posterior distribution is a joint distribution over All so I do that here. I compute I do the actual linear algebra part those are the final five lines of the Gaussian regression code that we had in lecture seven and lecture nine and Then make a plot and now you see what the posterior? Looks like this isn't in itself all that interesting yet, but it's already a much better model. You see that We now have a range of posterior variants. These are these dashed lines that actually in Includes the entire data so when we don't have this overconfidence anymore And we get an extrapolation that is relatively sane. It's actually constant. The reason for that is the venous process prior which Yeah, I described this browning motion behavior into the future This is under the assumption that there are no further Features after afterwards so remember the way I've defined the features is that I have assumed that after the end of the Data there is none of these lifestyle choices are active anymore So what we see here is the model predicting what happens if you don't do anything if I don't start running again If I don't go to the gym if I don't diet or eat or anything else This is the random walk behavior that the model keeps predicting What however would happen if I would continue to make one of these lifestyle choices? Why for this we have to look at the marginal posterior Over the individual weights of these features and to do so we can compute these Doing so requires to compute a marginal for this We can use the fact that the model we're building is a sum over individual components to noise processes and five different features And we want a marginal over one of these features to do So we just have to project the posterior over the entire function out onto the single marginal for this one quantity Doing so it's easiest for you to just wrap your head around by staring at it for a while is to compute these This marginal vector over the weights. So this means you compute the Gaussian process posterior Almost like we multiply the Observations with the inverse of the kernel gram matrix and then we just compute the covariance between the data and the individual Components of the function, which are the feature functions scaled by their prior covariance We do that. So this is again something it's easiest for you to just actually look at this code later It's not good if I go through this line by line because it's just going to be TDS for some of you. So I actually compute these the entries of this vector and then I'm So these are individual numbers that come out either five different scalars about each of which Actually, we have a joint Gaussian posterior over them with a joint covariance matrix. So I care about the individual elements So what I'm going to do is I'm going to print the means plus minus one standard deviation for these individual effects and For that, I just have a little bit of like Python code here to print formatted well formatted sentences and we get in quotation marks a scientific answer to the Inference problem of how do individual lifestyle choices affect my Personal body weight and it turns out that if I go running in the way I went running back in the day I would lose typically about 30 grams a day plus minus seven grams and If I go on a diet I lose a little bit less on average but actually within the arrow bars these two are almost the same value about 25 grams a day and So you can see that these are not extremely large values, right? They're not they're not like kilograms per day, which is not surprising because these are these are actual systematic changes they're not just random fluctuations and The other two lifestyle choices three left our choices have a different effect so so eating too much is actually almost as as bad negatively as going going on runs and dieting and Going to the gym as a relatively weak effect and eating vegetarian has Perhaps almost no effect. So this line is actually the most interesting one Why is this number so small and the arrow bar on it is relatively low? Well, the reason for that is that we actually have a very minor data set only for this. So It's clear that from these few observations It's basically impossible to infer the effect of eating a vegetarian diet or not, right? however That's okay So this is maybe reflected in this prediction here the mean prediction is basically zero because we don't know But what's interesting is that this error is so small this Algorithm is relatively confident that this is the right explanation. Why is that? Well, the reason for this is that We are up up above have computed Maximum likelihood estimate for the hyper parameters of this model Our maximum likelihood estimation does not like it Includes this this Occam's term that we spoke about in lecture 9 and the best way to make this model less complex in terms of the Occam factor is to just make this variance of this particular additional component as small as possible because it's not needed to explain the data set So the model actually is quite confident that it doesn't need this parameter to explain what's going on now That's not maybe the answer. You're looking for maybe from a scientific perspective you want to say that You know that there is an effect to eating we're eating vegetarian You just don't know yet what its value is to fix that you would need a prior So that's a penalty term in our optimization problem above that Enforces that the size of this effect is non-trivial that is not just zero And you can think for yourself about how you would do that If you don't know how maybe ask me in the feedback or in the flip classroom, and then we can talk about it So we're almost done. The final thing I haven't shown you yet are the other two components of this stochastic process So we this we describe this data set in terms of seven components five of which are causal and two of which are noise processes the noise processes in this plot above here are already included here in this prediction But we can also look at these individually just like we can look at the effect of the individual features so to do so we can just plot the postivia and prior distributions for These two noise processes separately and for that you basically have to adapt this function up here this Posterior instead of the covariance with the weights here. We need the covariance with these Noise processes which are just the individual kernels for these noise processes So to win when I do that I get a plot like this So what I'm I can tell you what I do here what I'm plotting here is the full posterior above that you just saw above It's just we plot it and then below Plots about these two different noise processes this plot is the plot for the Stoke for the mean reverting noise process the one that just make describes the ups and downs of daily life that get corrected automatically and Here I'm plotting in solid red the posterior mean so what the model things actually happened in my life and In dark black the Posterior samples so that's other possible explanations of what happened in my life and in gold draws from the prior So because these overlap and they basically look just look like smush just like noise That's actually a good message. This means that the prior is actually a very good description of what's going on This is interesting because it means that at least according to this model There are no other underlying causes that are badly described by This model right if if there was something else more structured going on in this data set Then we would expect these red dots in this in this plot to show up very clearly outside of the prior predictions, so that's good. Let's look at the same plot for the Not mean reverting for the Random walk and noise process that is also in there here again. We see in red the Positivia mean and In black draws from the posterior distribution over this process and in gold draws from the prior now here you Could argue that there's maybe a little bit of structure hidden in this data set in so far as this red line is a bit more regular Then maybe we would expect but this is a very minor effect So the black draws posterior draws do seem a little bit more Structured going up and down on different timescales than the than the golden Draws from the prior and this might mean that there are still some individual effects that are not just totally random at Constant time intervals ups and downs, but for example, there might be some seasonal effects in here as well But overall this is a very minor kind of effect So having done this kind of posterior discussion Maybe looked at the data done some some sanity checks These plots might convince us that what we've built here is actually a relatively expressive model for this particular data set and What we've learned from it is that on these personally I for me have learned that if I make certain choices in my life I can expect certain changes in my body weight over time and That they are roughly the same regardless if I either start to eat less or very seriously start to work out and That allows me maybe to take a better decision about how to structure my life So at this point you might be wondering why you have to watch me do this Perhaps overly intimate and personal analysis of my of my own lifestyle choices. Well, this Exercise was essentially just a placeholder example for a forecasting problem And so what we've done here is we've looked at a time series that evolves over time and Is affected by certain causal processes which can be chosen to be followed in a particular way I can choose myself to Go on runs or eat more or less and a company could decide to run a promotion or to switch suppliers or to lower their prices or raise their prices and Try to predict whether all these choices would have an effect on their sales So to drive that point home. I can tell you that in 2016 Amazon the team a team in Berlin led by this guy Mathias Sega who is actually an alumnus of a tubing and Made the interesting decision to release a paper at New Ribs or publish a paper at New Ribs which essentially More or less reveals how they do their demand forecasting So here is a large company with a with a large amounts of item in stock Not just one individual one like in my example, but many many thousands or hundreds of thousands of items and They would like to predict How much of these items they are going to sell at particular points in time? and their underlying causal reasons why items sell some of them are controllable like You could decide to do a promotion or lower or raise their prices Others are not controllable. For example, there are maybe just seasonal fluctuations There are effects out in the world that cause certain items to become more popular or less popular Whether it doesn't really matter What the case is one way or another you would like to describe these kind of causal processes and to do so You have to build more or less exactly this kind of model for every individual item You have to build a time series and predict how it's going to evolve into the future Of course, if you don't do this for a single time series But for hundreds of thousands of items then you have to think a little bit more about the computational details And how to implement this efficiently. That's what this paper is about But the model underneath is very much more or less exactly the kind of model we constructed today So what I've done sneakily while showing you data about my own weight-tracking process is I've given you an introduction on how to do forecasting in general in an industrial setting with that we're at the end of today's lecture and I'd like to summarize So what we did today was a hands-on example of how to build a predictive Gaussian model That describes a particular data set We saw at first that if you just take a time series and don't understand that don't with that doesn't come with further Understanding of how it was created. It's usually very difficult to predict anything meaningful into the future because the structure that Is represented in this data set It's typically very hard to find the space of possible explanations. It's just so large that you can't expect to just guess it Unfortunately, that's something a lot of people try to do What's a much more promising approach is to go and talk to the owner of the data in this case That was myself. So we simulated this process a little bit and Try to extract knowledge about the causal underlying generative process for this data Implement that in terms of parametric feature functions, we choose we chose today particularly simple feature functions In general, you would learn to encode what you think you can encode about this data set to do So you have to strike a balance between what you would like to do and what the data actually allows you to do given how much information it contains and Then the remaining process is actually a relatively mechanical one you build Gaussian process prediction model Optimized the hyper parameters of this model and you saw me do this in a quite hands-on fashion So that you actually keep looking at the data you keep doing unit tests and sanity checks and this process of Actually looking at data at model in a meaningful way to compare them with each other and see whether they work is something that Unfortunately, very many people in the industry don't do so if you want to have a well-working model You have to do these kind of sanity checks and these amount essentially to unit tests in the machine learning domain The resulting model that we arrived at provided a meaningful maybe semi-scientific answer to this admittedly quite banal problem of predicting the dynamics of my own body weight and That problem is very directly related to many real-world Tasks in industrial and scientific setting settings that have real economical scientific even societal value for example all problems of forecasting of for example demand and supply predicting returns on stocks and other items you can buy and sell Decisions on what kind of promotions you want to show what kind of ads to show to users and so on and so on The ability to build these models clearly goes beyond just using black boxes and To be able to build such models in a really good way you need to have understanding of the mathematical model that drives this entire process and also a kind of a hands-on grit on That allows you to actually work with the data and not just waive some fun some mathematical formulas around I tried to show you an example of how to do that today It's obviously a very limited simple example geared to a compact lecture But I hope I've given you an idea of how to build Concrete probabilistic machine learning models for this kind of relatively simple structured regression problem yourself with that where we were at the end and Thank you very much for your time. See you next lecture