 Okay, welcome back everyone to the fourth lecture. This is the last lecture before holiday gap You'll all be set free for a few weeks. So I'll give you a very enjoyable homework to do over those several weeks I'll talk about that in the last slide. I have a lot to talk about today And I'm excited about this because there's some new material here with wiggly functions to get to at the end That said everything you will see today is actually a linear regression despite the fact that there will be some curves This is part of the mystery to help you understand is how that actually works Okay, where were we last time? We had done our first linear regression not not the first for all of you, but You're first and you're born again education and linear regression and We managed to get this posterior distribution for the regression of height on weight and We can take just from the pricey output The a and b values and draw a line with those where a is the value of the expected value of height When weight is at its average value in this case, it's a hundred and fifty five Let's call it you can see it on that the average value of weight is in this data set somewhere around 45 something like that and and Then b is for every unit every unit change in weight The expected change in height is almost exactly one unit as well point nine Kilograms versus centimeters by the way, okay This is insufficient because we want to get uncertainty onto this graph The posterior distribution is not a single line Bayesian inference doesn't give you a point estimate It gives you the posterior distribution which contains an infinite number of lines Each of them ranked by its relative plausibility compared to all the other infinite number of lines. It's in calculus wonderful and So we'd like to get some more lines on this graph. So let me walk you through the idea of Showing the uncertainty in the inference Here's the basic idea We're going to sample from the posterior distribution to do this One of the reasons to do use the sampling procedures. It's easier to think with But another very important one is this procedure will work for any model you ever want to fit No matter what the basic functions are that are inside of it any model You can sample from the posterior and then just push the samples back through the model itself to plot the uncertainty in its Inferences it works for anything linear regressions have analytical solutions for the for the compatibility intervals Lots of fancy interesting models you'll want to do in your life don't This procedure will work for anything and that's why I teach it and not the special case We get a of course our approximation of the posterior distribution from this quadratic approximation assuming it's multivariate normal and then we process the samples To effectively integrate over all the uncertainty in the posterior distribution You're doing calculus here, but it does not feel like you're doing calculus. Yeah, if you'd like doing calculus, you're welcome to do it for real But this this is real calculus, but it just doesn't feel like it. It feels so much nicer, right? So when we sample from the posterior distribution, there's this function extract samples in The rethinking package which again all that function does you can look inside its guts if you want all it does is Use the quadratic approximation to find that as a multivariate normal and use the built-in random number generator to sample random values from that Multivariate normal and then you end up with this data frame. It looks like data, but it's not it's random samples from the posterior distribution each row in this Data frame is a line All right, the a and b values define a line and the positive distribution is full of lines lots of lines lots of different ones But lines that are more plausible have more ways to produce the data that you've actually seen are going to show up more in these Random samples so we can plot all the rows from this thing and have a bunch of lines on our graph and Then they'll overlap more in the area where there's plausible that the lines exist and from that we get the uncertainty and stuff I'm going to show you pictures of this in a second so you understand, but that's all we're doing and this works for again anything so To see this work and also reinforce for you the way Bayesian updating works Let's start with a reduced data set to begin instead of the I don't know What is it 350 adult individuals in the data set? Let's start with 10 10 randomly sampled adults and Bit our linear regression model to it. What does that mean? We means we get a quadratic approximation of the posterior distribution Sample from it we get some lines and then I plot I think those are 20 lines from the posterior distribution There's 20 of these rows I think the a and b value on any particular row draw line with that and then the next one the next one and you'll see They're very different than the prior you may remember the prior was sort all over the place It was positive in right well, it was we first it was crazy and then we made it less crazy But now they're they're very concentrated around where the data are that they saw because our model learned from the data You can see there's a lot of scatter because with only 10 individuals The model just really isn't sure exactly where the line would be this makes sense you can see it you can visualize it this way Let's increase the sample now 50 we add 40 more adults into the sample fit update the model Again plot 20 lines. You'll see that they get more concentrated And you'll start to see this phenomenon that is universal in regression models that the uncertainty at the ends is broader than it is at the mean If you look at the mean of both variables That's this pivot point that the regression lines pivot around and you have a lot more certainty around there Right the line must pass through the mean of both variables basically it has to otherwise. It's a terrible lot Right if you've read the average weight And you I ask you if I tell you that there's an individual in this population at the average weight And I ask you now what's your best guess of their height? You should guess the average height Right any other guess is silly. I assert and I'll let you think about that over the weekend Right so the model didn't know that it had to figure it out You can give it that information, but then that's an intuition you might have The ends on the other hand are more uncertain because a tiny little pivot is going to create a big amount of uncertainty at the extremes That's why you get this what I call the bow tie phenomenon Like bow tie pasta people used to wear ties that were actually shaped like that, but now it's just pasta, right? And 150 points it gets very tight now Around that and then if we use the whole data set at 352 individuals very tight indeed Conditional on wanting a line to describe these data lines are all in that narrow space That doesn't mean the line is right That is but you constrain the model to pick out lines and so it did and it likes these lines Very much Hates all the other lines, right? So the high probability lines are all in that region. Does this make sense? You see how this works and so when you see in papers those Confidence regions drawn around a regression line. Those are just smooth nice ways of showing the scatter of lines But you could just show the lines. This is aesthetic to That said I can show you how to draw the confidence regions. Let me walk you through this This this section of the chapter of chapter 4 moves very slowly So I encourage you to read it afterwards and go through all the steps I'm going to run through the recipe for you here You will still have questions afterwards, but I think the chapter will answer them for you So the basic idea is that any particular value of the excesses variable of weight in this example There's a distribution of predictions. Mu has some Density it's very confident about values in the middle unless confident with values outside So look let's pick any old random value of the x-axis variable to use as an example like 50 50 kilograms And so we consider an individual 50 kilograms What is the model expect the height of this individual is like and we get this from the equation for mu We extract the samples and we just push the samples back into mu But you see I fixed a 50 in the prediction equation. Yeah See it the number 50 special out there shining bright in the expression and then Mu at 50 now is going to be a long list with one Value of mu for every sample from the posterior distribution and that means there's a distribution of mused now, right and That's what the model sees is it sees it isn't sure how tall the individual of 50 is but it sure it's in this region defined by this density This makes sense And so you get an expected value of mu in the middle and you also get the interval from this the Compatibility interval the range of values which are compatible with this model in these data Make sense So we can do that we don't want to do this just for 50 though We want to do it for every x axis value And then for every one of these we can use these distributions to have an interval to draw that bowtie up And it turns out to be the same calculation as just clustering a bunch of lines on your craft the spaghetti Approach right and this this is what produces the smooth bowtie. So here's how it looks You need again the chapter walks through this very slowly and you'll have to sit down with your computer and And do it for real You need to make a sequence of x-axis values. You want to calculate these intervals at that's what I call weight that sequence up top From 25 kilograms to 70 in units of one kilogram. It's nothing special about those values You decide how smooth you want this to be How far you want it to go you can make the model extrapolate out to 150 kilograms if you want, you know Whatever you like to do And then You just send these in as fake data to this helper function in the rethinking package called link and what does link do? Link calculates the value of any linear model in a quat model for any data You push into it using the posterior distribution that you've got it for that model And there's a box in the chapter which shows you the guts of how link works I think I say a little bit about that on the next slide actually but not the details of it It's not doing anything magically except taking each value in weight sequence and pushing it into mu and then storing those in the next It just loops it's all it does, but it's nice to have a function like link to save you a little bit of time Right, that's why I wrote it to save me time. It wasn't for you. It was for me But you're welcome to use it and What you end up with in this value mu is a matrix with a thousand rows Each row is a sample from the posterior distribution and each column is a value of weight So there are 46 of them And then you can plot these up and make pretty pictures with them So this is the only thing I'm going to say about how link works and again There's a very detailed box in the chapter to explain how this works It samples from the posterior distribution It takes the series of predictor values that you give it if you don't give it any They'll use the values that were in them data you passing when you fit them all And then for each of these values it just sticks it into the The linear model and then it returns that big matrix and then our job is to summarize that and so here's the code where we do this Extract this is like rolling your own link right here. This is all the code you need to do it and You extract we write our own function for new link We've arrived just written post a and post b times weight minus x bar or weight gets passed in as any random value And then we use this cool function in our called s apply which you might will just call it loop What does it do it takes each value in the first argument and it passes it into the function in the second argument and S means simplify it simplified apply it applies the first thing to the second thing and then it simplifies the result That's what it does and then we Do apply I know ours Maniacal right it's good you there's s apply and there's apply and how are they different don't don't ask that question Just just keep moving no S supply simplifies the result apply doesn't Which is what we want here apply takes the the matrix mu The second dimension which are the columns and it calculates the mean beach All right, so it gives you the average over all the posterior samples at each so that gives you the central prediction for each Weight value and then when we pass it to HP DI we get the highest posterior density interval for each weight value And those are the values we're going to plot out that gives us a bowtie So you think about what's going on graphically is it each weight? We've got a distribution and I just plotted them as points on the left here. So it's fuzzy, right? You can see the bowtie and Then now it's very hard to see Protected in the room, but those of you watching at home will be able to see it There's a grey bowtie that has the same shape as the points plotted on the left You can almost see it if you're in the room right squint Y'all have good eyes. Can you see it? Yeah, okay laughing faces in the room. It's there. It's very light gray You can almost see it very tight interval It's the same information and it just comes from plotting the boundaries of that compatibility interval It's all it is and that's where the magic bowtie comes from, but this will work for any function any shape We're going to do it for polynomials and splines at the end of today And you'll see that the same procedure works exactly same even if the additive model changes Okay, so again, there's a correspondence between the spaghetti plotting style This guy who you just line up all your your strands of spaghetti all the different lines from the posterior posterior Distribution plotted here again for 10, 20, 50, 100, 200, 350 individuals in the data And you see how the uncertainty gets narrower as you go And then we can just think about it in the bowtie form where it's the same information But now we computed the interval at every x-axis value and we just draw that shady shape defined by those intervals And I'll oscillate a little bit back and forth. You can see spaghetti style And there's the shape of interval style. It's the same information. It's just a different visual presentation Make sense One of the advantages of spaghetti style is it makes clear that there's no boundary that has meaning One of the problems I think with these compatibility intervals is it it's easy for all of us to slip into the idea that the boundary You have arbitrarily chosen to draw this graph has some meaning some scientific meaning and it doesn't there's nothing Nothing happens at that boundary. There's a continuous change in uncertainty Right and the values on one side of that boundary inside of it and outside of it are nearly the same There's no magic event that happens there. That's true of all of these things, right? Probability is a continuous space and nothing happens at that boundary I'm not going to spend a lot of time in this but again There's a slow and patient section in the text on this we can do the same procedure for sigma for the Prediction interval around all of the heights what we've just done is only for the mean There's uncertainty about where the mean is but there's also uncertainty about the exact heights you're going to observe There's some envelope of values that we expect the whole distribution of heights to be in and a good model will predict that range So I show you that here That light gray region You'll see our old bow tie there in the middle But now there's this light gray region which covers most of the actuals or types and that's when you use sigma to simulate actual Heights and you calculate intervals from that as well the code to do this is in the chapter. There's no new tricks Zach works exactly the same way And there's a helper function in rethinking called sim which does this whole thing it pushes things on the outcome space But you know there's a box that shows you exactly how it works And I think the important thing for all of you now is to understand conceptually what this is doing and how it relates to the Structure of the model. Yeah, and then the coding details if you're fuzzy on those those will come fine when you do homework It always does you just have to sit down and do it But you want to get the conceptual stuff sorted first Okay That's linear regression the funny thing about linear regression is it's not linear It's just like a maddening thing about the term. It's conventionally. It's used to draw lines on graphs, but Really linear regression is additive you have this Equation for the mean for mu which is a sum of some parameter time some observed variable There's a sum of a bunch of terms like that And it's an additive equation and additive things are linear in mathematics. They're like swappable words But for a human being the words additive and linear are really different, right? So we should call these additive regressions because you can use them to draw things that don't look like lines And that's what I want to do for you now I want to spend the rest of today having fun drawing Lines sorry that people people aren't in the room can't see me doing scare codes Anyway, you can hear the sarcasm in my voice lines And curves we're going to draw curves from lines and the first of these Well, I should say why do you want to do this? There's there's no reason that nature should be populated by linear relationships between two variables, right? Maybe of our narrow ranges. That's fine as an approximation But eventually it's pretty silly. I'll show you some credible cases in a moment. We'll work through some data examples And so we routinely have reason to think about Curve linear relationships between two variables If your social scientists are used to this because age if you study humans age can't be linear over the lifespan of a human We just live too long So linear effects with age is never only over the narrowest little ranges I mean even you don't need a long nice band child development There's rapid changes in behavior over child development linear functions don't work very well Even over like three-year periods from five to eight you need nonlinear relationships, so There are common strategies the two most common what I want to introduce you to today our polynomial regression and This is by far the most common This is the common idea of adding a squared term to a linear regression I'll show you the mechanics of this in a moment. It's extremely common. It's also pretty bad I think if you're wise about this There's nothing wrong with using it as is always if you if you understand the golem you're using then you can use it responsibly Often polynomial regression is used irresponsibly Because of a lack of good training that people have received again. It's never your fault. It's a vast sociological conspiracy and So I'm not trying to dissuade you from using polynomial Regressions, but I want to caution you about them and I'll give you some Suggestions as I teach them to you about why I think they're badly behaved and The second is splines There are many different kinds of splines. I'm going to show you what are called basis splines which are Probably the most common almost certainly the most common If you use computer drawing software, you've used basis splines That's how illustrator they're busy a curves and then their basis splines and basis lines are better than busy a curves And they're used in drawing Splines are very flexible much more flexible in polynomials and they don't exhibit The common pathologies of polynomial equations and so they're often a much better choice That said both polynomial regressions and splines are geocentric strategies. There's nothing mechanistic about them There's no science in the shape of the function. They're just approximations like epicycles right? Ptolemaic devices to let us predict where things are So you have to when you receive the information from your model you have to keep that in mind and realize that there's nothing Mechanistic about this and therefore the predictions are not necessarily trustworthy when you extrapolate outside the range of the data They can exhibit very strange behavior That said if you keep that in mind, it's perfectly fine Just like the geocentric model if you want to build a planetarium the geocentric model is good. It's really really useful But you need to do is it responsibly? Yeah Okay, what is polynomial regression? I know the name just rolls off the top This is a descriptive strategy for drawing curves Relationship between two variables two or more variables and the idea here is that there's nothing special about The line that's a first order polynomial You can have second-order polynomials as well You once learned this as a parabola and then all we have to do to create one is add Take our x-axis variable and square it and give it its own coefficient for the square term And now that's the equation for a parabola. You learned once upon a time I appreciate that you may have blacked out forgotten all about that part of your secondary school education Right where you did geometry proofs never did it again trigonometry conic sections. Yeah, sorry it was a good time wasn't it a simpler part of your life and We can keep going. There's third-order polynomials where you add a cubic term fourth order we add What is it what is to the fourth called is there a fancy what is it? Quartic, thank you. I need to bust a bit of spin return for this quartic and then quintic is to the fifth So on heptic what is six? and on and on and on And we'll do that in a later chapter actually we'll push this to the level of absurdity at the Beginning of chapter seven. I have an absurd exercise. You'll enjoy where we do this. I think we go all the way to sixth-order polynomials We're not going to do that today, we're just going to do up to third-order Well, the data we're going to use are the total sample from the cone height and weight data And so before we were just working with adults and you see I'm here shaded in blue now We're going to include all the kids Kids or people too and we want to predict their weights and heights So I think just looking at this scatter plot you can appreciate that this is not a line you can fit a line to this And we'll do that. I'll show you what that looks like. It just won't fit very well The model will be very happy with its line. It'll be very sure about which line actually to use But it'll be a terrible set of predictions Instead let's fit a problem and this will be a curvilinear relationship so How do we build this well, we can just take the previous model again It's a geocentric strategy. So we just glue on an epic cycle here And we square the centered weight so x here keep in mind should be centered you want to have already subtracted the mean value of x from it so you have a centered variable and Why so that alpha can be the mean right So you set a prior for alpha that makes sense and then you need to give it a new beta coefficient here B2 Setting priors for these higher-order terms and polynomial regressions is really hard And I've just made up a harmless one here I encourage you to simulate from this prior and see what happens and try that or try what try different values It's hard because Beta beta sub two has no meaning Biologically, it's the curvature of this thing, but the total curvature of the parabola depends upon both B1 and B2 so Neither of them can be interpreted in isolation and the only way to understand what this prior means is to simulate from the prior and see What the predictions are this is super awkward? Absolutely is it's one of the drawbacks of parabolic models is that the parameters don't have meaning Right and the shape of the curve is determined only by all of the parameters It's it's a horrible problem in interpretation and then once you fit the model the same thing is still true You'll have some summary statistics about these betas and you just can't look at them and figure out what's going on You didn't have to plot the predictions. That's not so bad though for any reasonably complicated model That's the only way to understand it anyway is to plot the predictions. So you should get used to it Yeah, that sounds like I'm trying to scare you off from this. I'm not you just have to use it responsibly, right? It's like driving. You just it's fine to drive driving is fine. It is Everybody should do less of it, but You should do it responsibly Anyway, otherwise, it's the same model This is just this is a linear regression in the sense that linear means additive, but when you plot the relationship between X which is weight here and height. It's not going to look like a line Okay, so we fit these models with standardized predictors. I'm going to add one more step from before Before I said we wanted to center X so that we can interpret alpha and you nearly always want to do that But we also for the sake of just getting the machine to work It's very useful to standardize the predictive variables and this means you center them and then you divide by the standard deviation So we take weight We subtract the average weight from each weight value and then we Divide each of those zero centered values by the standard deviation of weight and this takes weight It makes it into a set of z-scorps. So if that's language that's familiar to people here And this is nice because then a value of one is one standard deviation out in weight from the mean This is a nice way to do the variables and the machine the fitting software our works better On standardized variables because it doesn't have to guess the scale of this thing It doesn't have to deal with giant values like 200 Then it has trouble searching that space So it's nearly always a good idea. This should be your default behavior with fitting regressions Unless you have some good scientific excuse not to Which may happen? This is a good thing to do So here's the code the first two lines of this code do the standardization All right There's a function in our called scale which will do this for you You don't have to work and write this out But I want to show you what it's actually doing all this all scale does is Subtract the mean we take each weight value and subtract the mean and then we divide it all by the standard deviation You end up with a new variable here. It's weight underscore s which is the standardized weight. It's z-scorps And then we construct the squared version of this by squaring it right, I know this is exciting science, right and And then we stuff it into the quack model formula as before It makes sense. I know there's a lot of extra stuff here But really it's just one little epicycle we're gluing in here and then there's the you know Data manicure we have to do to make the fitting work well if you don't standardize what happens I encourage you to try this at home Leave the raw weights don't center them don't standardize them and try running this model It will fit less efficiently. It may It may complain at you a lot about not being able to find good starting values all kinds of stuff may go wrong So it's it's just about rescaling things To help the fitting work well Okay Let's redo the spaghetti things, but now the spaghetti strands are not straight They've been cooked for 30 seconds, and so they're flexible and this metaphor is really not helping I should I should think about these before I use them but Now we've got parabolas the posterior is not full of lines now It's full of parabolas an infinite number of them every parabola that could exist is inside the posterior distribution Sounds great. It's magnificent isn't it and you did that and Now we sample from the posterior distribution and we've got a sample of the high probability Parabolas which are a tiny slice of that infinite space of parabolas and we draw them up So to see what's going on it will be helpful And this will help you understand how these parabolic models work to start with again only 10 individuals So I randomly sample 10 individuals They happen to all be adults because of the actually this is like the first 10 in the data set This is how I did it. I take the first 10 individuals in the howl data set They're all adults just because of the way they're ordered in the data set Which helps you see what happens when that happens over the full range now We get parabolas that fan out wildly outside the range of the observed values So the parabola fits the observed adult values which have higher weights You notice it weight zero means the average weight in the whole sample. It doesn't mean Zygote right it means average because it's standardized. Yeah so all the adults are above average weight right because they're heavier than children and But these parabolas end up being relatively straight in the range of the data because they need to be because the relationship is Pretty linear in that range and then outside of that range the functions are allowed to do anything they want And they just flail about on the left there all over a huge range of scientifically impossible values But of course the model doesn't know it's fine. So this is a phenomenon that's Always present in polynomial equations outside the observed range of data the function can do anything it wants and it will And it will just flail about and so the uncertainty intervals on the edges of the observed data range always fan out And expand this is a big problem actually with the predictions of these sorts of bottles is that they get Necessarily more uncertain on the edges of the observed range. This is not true of splines as I'll show you when we get to splines This arises because every parameter of the parabola this case parabola of the polynomial affects the shape at every point Every parameter acts globally on the shape. It's a super frustrating thing So you can't tune one parameter of a parabola and tune only one tiny region of its shape Just a very frustrating fact about how they work And again splines don't have this problem, which is why I will teach them to you Does this make sense for a second? Can I add some more people? Yeah, so now 20 people we had the next 10 now we've got some children and Now the flailing stops because there's some points in the lower range to inform the model How it should behave at lower weight values and now we get Curves within a much smaller region of parameter space make sense. We've got we've got some nice parabolas now 50 individuals We're filling in now the it's getting concentrated The model's growing more and more confident that if you want a parabola, these are your parabolas You want one of these they're getting piled up on top of one another and then with a hundred individuals even more so it gets darker and darker And 300 and all 544 individuals it's basically a thick dark lawn Those are a bunch of parabolas, but they're superimposed on one another the model is very confident So what is this model saying it's saying conditional on wanting a parabola to describe this relationship here are your parabolas And it's really really happy with these parabolas. It doesn't mean a parabola is correct Does that make sense? To drive that point home Let me show you some other polynomials. So we could also do the cubic, which is on the right here before I Or we Look at the cubic Look in the lower left at the linear one. You know how to fit this model the model also loves these lines It loves these lines the linear model loves its lines just as much as the parabolic model loves its parabolas Right and because each of these models you instructive it to only consider that shape and it found the shapes that are consistent with The data compatible with the data in that range and there's a lot of data So it becomes increasingly it's very very sure conditional on that shape that these are these are the parameter values You want but it that is not an endorsement of the model Does that make sense? It's the model endorsing those lines Yeah, this is always true It's the small world large word distinction the model only thinks in small world confidence And then you have to supervise it and be critical about the overall structure of the model because the model will never do that for itself It's just not responsible enough to do that Yeah, it will wreck prog so keep saying so Those lines don't fit the overall data very well, but there's almost no uncertainty in where they are Yeah, this is a funny thing the confidence intervals are tiny On the position of that line, but it's still a bad model You're familiar with the parabola. I've extended the data range out a tiny bit here showing you the quadratic Relationships you can see it tends to curve down up top This is another problem with polynomial models. They cannot do what are called monotonic Relationships, what does that mean a monotonic relationship never changes direction? They're always increases or decreases. There are lots of things in nature like that You know that people well really old people do get shorter Right, but in this data set that's not what's happening Why is that parabola curving down because it has to there has to come a point where parabola is curved down Right there rainbows Right Rainbows are actually circles. You just can't see the bottom, right? So it's this bad example, but unless you're in an airplane anybody ever seen a rainbow in an airplane You can see the whole circle So this thing must curve down right it just it has to some place and since you know The the downward curving part doesn't have any data. It's free. It doesn't hurt your fit at all It's absolutely fine Cubics do the same thing they're going to turn twice But they have to turn and a quartic equation is going to have to turn three times and they just have to turn They can never be monotonic And this is a problem There are other functions which are a monotonic and if you need one, let me know I have lots in my desk drawer The cubic polynomial we add the cubic term we take our standardized x we cube it We add it to the model another beta beta sub 3 model fits fine just as before the code for this is in the book you should definitely do it yourself and Then we plot that up that fits even better Before why because it can turn one more time so it really now goes through the center of this But you notice now it's extrapolating upwards probably to infinity up there on the far end bad behavior typical behavior of a polynomial model can That may not be a problem if you're responsible and you understand you you expect that behavior At the at the boundaries of observed data, but you have to think about that responsibly Okay, so this is just a summary of the stuff. I just told you I call the polynomial grief Polynomials Always make absurd predictions outside the range of the data in this example I had that outside of the range be really outside of the range It was below the minimum value we had seen this can also happen Internally to the range you see and when we get to chapter 7 I'll show you that if you've got a gap in your observed values and a very flexible polynomial It'll do silly things in the gap because again, there's no data in the gap So it's free. I'm gonna show you this with some hominin brain evolution data in chapter 7 right But there is a big gap between humans and other things and so in that gap The function is free to do anything it likes and it will do whatever it likes so that it can fit the data The parameters I think this is a bigger problem is The the parameter each parameter of the polynomial affects every point on the polynomial They all jointly determine the overall shape and so you can't the the model when it's tuning through the posterior Can't tune them independently to create local fit in different regions This is a big problem and it's actually the reason that you get these absurd predictions It's all related And I mentioned the monotonicity problem Polynomials aren't actually that flexible there. They have to turn right and they always turn a certain number of times One less than their degree so a second order polynomial turns once that's a parabola a third or a polynomial must turn twice It has to Can't avoid it and so on and so forth And so they can do strange things So what to do instead you have a number of options for the Rest of our time today. Let me tell you about a common one and one. I think is really useful But it's just as geocentric. It's not a mechanistic model. That's not necessarily bad I like geocentric models, right? You just have to use them responsibly and not over interpret them This approach is also satisfying because it's born from a physical system that was used previously to do the same thing They're called splines. What is a spline? It's a very strange word rights a spline is This metal bar that you see in this picture. This is a draftsman's table their weights that are attached To this bar to bend it so that drafters Architects could draw smooth curves in controlled ways back when people use paper Like drafters still do actually lots of people do this or if you make boats Is anybody here who made a boat as I know strange question No, okay If you want to make a boat to have to be the right shape using a spline is great for cutting out the shape of things It's a it's a very useful idea and otherwise you have to have a really good eye And you want the boat to fit right otherwise it leaks. That's bad times the boats like bring a bucket, right? so these things still exist you can go buy them in art stores and The spline is the bar and these weights are like anchors. We're going to call them knots There are places where you there's a pivot point potential pivot point for the shape of the spline So the splines we're going to use are called be splines and because they're anchored in these local places That's like parameter action They the parameters act locally not globally, but you can get a globally very wiggly. That's a technical scientific term wiggly wiggly function With a bunch of locally wiggly functions By putting them together, so I'll show you how this works This is a very effective strategy It's good for extrapolation and detringing of things, but you just have to remember that it's geocentric. It's describing relationships It's not explaining them so make sense And so you have to if you're going to use a reproduction you have to exercise some responsibility about how it works Little bit of terminology give you idea how these splines work. I'm going to use what are called basis splines Which is what I'm describing I'll tell you what a basis is in a moment The basis function is a local function and the whole spline is made out of interpolate Smoothly gliding between these local wiggly functions, but there's a number of them and each of them is called a basis function Basis you can think of as just a term for component right In this context in mathematical language So we're going to build a big wiggly function from smaller less wiggly functions But each of those smaller less wiggly functions has local parameters which describe its importance And so you can tune them individually and you don't get these wild swings that the polynomials exhibit This is what's nice about them. And if you've used computer a drawing software Chances are you've used a basis spline there'll be anchor points that you put on a curve And you'll drag it around either it's a bezier curve or it's a basis spline basis splines are better than bezier curves They can do more stuff. They're more wiggly Okay, oh last thing B-spines or basis splines are often called B-splines if you have a Bayesian B-spline I know it's it's called a P-spline. I Didn't it's not my fault Take no responsibility for this but I have to tell you these things so that you recognize it the P stands for penalized Because priors are often thought of as penalties in the non-Bayesian context So machine learning people use models that look analytically like Bayesian models But they call them penalty models and that's what the P stands for penalized spline You can think of it as priors Why the word penalty wait till chapter 7? I'll explain to you why the word penalty makes sense Okay, so let's go local. Thank what's this? Thank global act local, right? The global movement justice, yeah and So How do we do this again? It's a linear model. Is there's just an additive equation for mu but the Predictor variables that we put in here are not things we've observed their synthetic data quote-unquote data that we build that Defines the range of the curve that each parameter acts in I know this is like crazy It's like madness the actual predictor of interest will not appear in your model Nevertheless you will get a fantastic approximation of the relationship between the predictor variable of interest and the outcome This is so weird, but it works really well I'll explain to this to you step-by-step. I left a lot of time in the lecture for this We're going to go slow go step-by-step And then you know all the codes in the book You can you can master this so Look at our equation from you here looks looks just like a linear regression with a bunch of predictors B sub I 1 and B sub I 2 B sub I 3 these are like our polynomial equations These would be like now separate variables and they are they're going to be separate variables But we're going to make them up We make them up in a very special way so that they define the ranges that each of the parameters is relevant to The ranges of the values of the x-axis variable of interest so there's going to be one parameter called a weight parameter for each of the B variables B is for basis these are basis functions and The weights affect Predictions in the range defined by those B variables these B variables just turn on parameters Over different ranges of the x-axis variable and then the parameters affect the shape I'll say that again the B variable just turn on parameters over different finite ranges of the x-axis variable and Then the parameters affect the shape they'll be there's going to be pictures Okay, you know me right there's going to be pictures So these are synthetic variables these basis function variables and they can be super wiggly they can be linear I'm going to show you a simple linear example first, and then we're going to go full wiggle. Okay, I Want to use a new example the the height example is not really all that wiggly So let's use something really wiggly like climate data. You like climate data So here's a data set that I'm very fond of it's this historical Japanese cherry blossom festival data So the date that the first cherry blossom blooms in Japan has been recorded for about 1,200 years now And I'm giving you the data here to play with so and there's a matching climate record and it turns out there's a very interesting relationship between the data first bloom and the March temperature in the same year because the trees respond to temperature. That's how they their phenology works And there's a big signature as you can imagine of climate change phenomena in these data We're just going to look at the temperature data today I mean later on in the course we'll look at the relationship between two variables But when you load this data feel free to run a linear regression looking at the relationship between temperature and the date of the bloom I Assert they will be a very strong relationship and you will understand why yeah So we have Terry blossoms here as well in in Germany, right? So they're familiar thing But they really have a lot of them in Japan So what are these data look like? Load this data set We're just going to look at the temperature data as I said we've got 1215 observations. These are years The earliest date is a little after the year 800 There's a few gaps early on because records there are fires records were lost, right? There's also some temperature in the earlier dates are getting to temperature from like tree rings and all sorts of other things There's approximations eventually. It's like real recordings from you know devices. We now call thermometers, right? at least here and It's a fantastic data set so you can see this wiggles a lot and I've showed this temperature training with some transparency because there's some Overlap right some sometimes where the temperature is more constant and you get less movement and then other times where it moves fast And then you'll see, you know, you'll recognize the end of this trend, right? Everybody knows what's going on At the end part here. Yeah So our goal is to de-trend this temperature record meaning we want to fit a spline to it So we've got some ballistic interpretation for the average trajectory so that we could look at micro deviations and what they do Right, this is the general idea with this This is the shape and you can do lots of stuff with this But if you want to ask about the impact of shorter-term Phenomena you need some trend at a certain scale to compare those against and there's just lots of wiggling So let's just work with the idea of getting an approximation to this of some arbitrary quality And we're going to start with a terrible approximation and then we'll make it more wriggly We'll do both approximations with a B-spline technically a P-spline penalized by How do these splines work? Here's the recipe First you choose what are called knots knots are the locations of these heavy pieces of metal They're not actually heavy pieces of metal in mathematics. They're points where the basis splines pivot Determine the gaps between them in the width of the different basis splines, and there will be pictures coming up next Then you choose a degree of the basis functions degree here is is polynomial degree Determines how wiggly the local function can be but only locally now and again there will be pictures And I'll show you what this does And then as always in this business since this is a Bayesian course You eventually find the posterior distribution of these weights So there are an infinite number of combinations of weights, and they will make an infinite number of splines You could sample from the prior and view them But in the posterior there will be a much smaller range of weights which are compatible with the data And those that'll be the spline will show Makes sense But you'll run this just like any old linear regression because it is it's an additive predictor And it runs just like any old linear regression, but it will not look linear Okay, so let's do this with the Japanese cherry blossom temperature record I said there's no cherry blossoms in this trend yet. This is just a temperature record for one local in Japan is what it is so Let's start with a simple example where we choose just five knots at equal quantiles of the data There's a certain there's a big literature about how you choose knots One of the conventions is you choose them at evenly spaced quantiles of the data This is nice because it gives you more knots where there's more data It makes sense, but there are other algorithms as well And if you really get into splines you should read you know a book about them and get some sense about how this works Packages typically do something just magical for you determining them automatically as always with my book I'm not that nice, and I make you own every choice So here we're going to own this one, but you should go into the code and fill around with it and see how it changes things So we'll start with these five. We've got one at the median We've got an anchor at each extreme, and then we've got two evenly spaced anchors in between these are like those heavy weights on the piece of metal And this is where the functions can pivot and determines how many basis functions we get so let's draw up the synthetic variables now Push the data off the table and let's just think about construction of these synthetic variables But they're constructed over the range that we want to interpolate over so the the actual Year data now think of year as a variable. We're never going to use those again We just use them to define knots and the knots have value of year Right the nuts are anchored on your values, and we choose it that way, but we'll never use those your values again They're gone right well we use of the plot, but that's all we're going to do And now we make some basis functions, and I'm going to start with a degree one basis function Which means a straight line so we're going to do is we're going to instruct a wiggly function, which is composed of lines Then we'll do something even wiggler, but I thought this would be the easiest place to start So what is going on in this graph? I appreciate when I tried to learn about splines, and I came across these graphs that was like WTF, right? But what is going on in this scientific term right international scientific term for confusion WTF and so What you're looking at is a plot with all of those synthetic variables There are five of them five basis functions These are new synthetic variables each of them just turns on a weight parameter Over a finite range of the x-axis variable, so work with me here. Let's focus on the one in red Looks like a tent right that's basis function for basis function for has its maximum value at the fourth knot Right and at the fourth knot. It's the only basis function that has a non-zero value Yeah, and so the so the its weight is determining the position of the spline at that point And then as you move away from that knot there are two neighboring basis functions, which are turning on Because their ranges are becoming relevant and this is how the interpolation works I know this looks like madness, but it makes some beautiful curves and So the fifth one turns on as you go towards the fifth knot and the third one is gets more powerful Good as a third knot and then the second one and then the first one So you've got five functions here that are gradually turning on and turning off and they overlap one another and each of them has its own Precious parameter that determines it the value of the spline When you're in its range Okay, there'll be more pictures. You don't have to get it all right now This is what the basis functions are when we make these higher degree. These will be curves I'll show you a picture of that in a little bit, but I thought it'd be nicer to start with lines I hope I was correct So at any so I've drawn this year 1306 there for the point to show you that at any particular point Except when you're right at a knot two of these basis functions are relevant And so two parameters will be active and will determine the value of the prediction for the temperature Yeah, but which two are active changes that you slide across and this is how splines are local Right the parameters act locally, but you get a globally wiggly thing of beauty Okay, so what do we do with these things if we're going to actually Oh, by the way, our has a built-in function for making these basis functions for you So you don't have just built in because this is a very old strategy That's built into our I'll show you the code to do this in the chapter skipping that step of constructing those things There's just this mu is a thing you're familiar with the only new trick here is I'm going to use linear algebra I'm going to use matrix multiplication to deal with this long sum. You could have a lot of basis functions We're going to have 15 in a second. We only got five right now So you can imagine writing five terms, right? It would just be w1 times the first column of this the first basis function variable There's like a variable and as value zero everywhere except those places where it's relevant, right? It goes to one there Plus w2 times the second basis function w3 w4 w5 you only have five right now, but if you have 15 you don't want to keep writing The way I've written it here using matrix multiplication We can just have a matrix B where every it's like a data table every column is a basis function and every row is a year It's this is a synthetic data frame and if we multiply matrix multiply that matrix By a vector of weights w we get this linear predictor mu Those of you who've done some linear algebra you're like of course you do those of you haven't you're like w2f again Right the international response for confusion. That's okay You can learn this matrix multiplication in your leisure. The important thing to understand is it's just notational compression There's no new math in linear algebra. It's all the same boring math you learned in secondary school It's just expressed in a more confusing way No, it's it's expressed in a very convenient way that makes calculations convenient So that if the dimension of B and W change you don't have to change your code. That's really nice That's why I want to use it here. This is actually a really common way to talk about linear regression models So this code will run it works fine and crap because it is a linear regression Nothing new is going on here We've got a prior for each of the weights if you make that prior tighter the curve gets less wiggly And this is the penalty we've been talking about You're gonna have to have some patience with me when we get to chapter 7 We're gonna talk about those kinds of priors a lot and what they do for us and why you don't want flat priors there You want penalties that reduce the wiggliness so that you don't do something called overfitting Overfitting will ruin your day right or your tomorrow more accurately. It'll make Overfitting will make today great. It makes tomorrow terrible So what happens When we plot these so top part of this slide, I'm just repeating the equation for mu Up to the dot dot dot right, you know what dot dot dot means like na na na just keeps going right Then I repeat plot of just the basis functions as before Those are never going to change their data now frozen in time and then I show you the the posterior mean weights Multiplied by those basis functions because this is what that predictor at top is you take each of the basis functions And you multiply it by weight, so I'm focusing on the posterior mean after we run the quat model There's uncertainty here, and I'll show you what that looks like in a little bit But show you what happens what happens now is that you're adding at any point on the Horizontal axis in this model say at the year 1306 again. You're just adding those values together Where the lines are and so two lines get added together and that determines the prediction for the temperature We'll get the third thing up in a second, but I want you to appreciate this is what's going on and so Here it is Now showing you the resulting spline. It's not super wiggly. I told you it was lines, right? So this is a piecewise linear approximation of the temperature trend and it's you might say well This is awful well that depends on what you want to do if you want to detrend in a very long time scale This is perfect, and then you can look at you know smaller like a hundred year oscillations in temperature using known data about causes Like ocean cycles and things like that. It's about what you want to detrend and what scale you want to do it at Yeah, so again the you add The basis time weight things because that's the equation from you says to do and you get the spline at the bottom The spline at the bottom has the uncertainty interval in it. That's why it's fat and has a little bowtie shapes to it So I do that's for the full posterior. I think that's the 89% Compatibility interval for you shown at the bottom Just make sense does it make enough sense that you can go into the chapter and run the code and figure it out That's really all the lectures can do for you. I'm afraid. Yeah Okay Let's do we're new let's do more wiggles most people think about splines as being things that you're much wiggler than this So let's do a higher order spine. Let's do a third order Cubic splines are probably the most common splines. They're flexible, but they're not crazy That's a good consensus term Right, so here's an example. Let's do 15 knots the more knots you have the more times the thing can bend I encourage you when you go home you play with this code There's just a place in my code where the number 15 appears make that any number you like make it smaller Make it bigger look at how it changes the prediction You can just keep rerunning the code changing the number of knots changing the basis of this thing The what you'll see and this is nearly always true with basis functions is You can have too few knots to fit the curve But at some point you've got enough knots and there'll be almost no change in the curve after a certain point I think with this that happens around 40 something like that I haven't done a really systematic study, but go ahead 40 is no problem. You can have 40 parameters and this crap can handle It'll do it. You know, you might have to wait a little while like a minute or something right, so This is what the basis functions look like in that case. We have a bunch of local functions now. They're wiggler as they're Cubic But they also overlap more so they need a particular point you can have as many as Four of these curves overlapping in any particular point and so more parameters are Turned on at any particular point on the x-axis and so you can get more complicated shapes But each parameter still only acts locally This is the magic of it. And so you don't get that bizarre phenomenon with polynomials What does this look like when we when we get the posterior distribution here? I'm showing you I think this is 50 samples for the posterior distribution of this models for the for the Weights and I multiply the weights times the basis functions. This looks great. Doesn't it? love this rainbow of joy And so they you see there's all this wiggle and again at every point you just add all these together and These weighted basis functions and you can get a spot prediction. I'll show you that in a second I want to show you is that there's some of these regions. There's substantial uncertainty about the weight, right? You can see there's a lot of scatter going on In a second that's a lot of that uncertainty is going to collapse And this is a phenomenon you'll get used to with Bayesian models is at the parameter level There's typically more uncertainty than there is at the prediction level and the reason is because The parameters combine to make a prediction And so you can be uncertain about the exact value of each parameter, but you'd be really certain about their sum And these models typically think that way if you ask them, okay, tell me exactly the value of weight parameter three It's like well, it's between this value and this value And then say okay now tell me the the position of the spline At at the point where that parameter is active and it can be super confident about that Because if you make w three the third weight bigger You have to make something else smaller to maintain the prediction in the range of the data and the model's handling all that because It's just using Bayes formula to rank the relative plausibility since it handles the mechanics of all those trade-offs for you So when you plot the lower level representation, that is the parameters of the model They're typically going to be more uncertainty than there is about the predictions of the model So weird thing you're going to get used to this You'll see it over and over again, and I'll remind you because I'm annoying Of these phenomenons over of this this phenomenon over and over again Okay, uh, I want to draw your attention. Of course the the trend has to go up at the end right because of you know carbon and So you'll see what the 15th basis function ends up doing with its weight It has a very high weight right it really pops up like a dandy line On the end over there and you can get a sense how these things combine So when we add them now, we've got a much smoother wiggly spline of the temperature trend Now of course this could get wigglier and again, it's about your scientific purpose How much do you want to be trended because in what scale of fluctuations are you interested in scientifically? Right because the spline will let you subtract out Any in your main variation deviations from this spline is some other phenomenon you could study with other data Yeah, like ocean cycles or whatever Does this make some sense? So again, you need to go home and run this data Run this code and play around with the wiggliness and how many knots and The degree of the basis functions you like and you can See how this thing responds change the priors to because there's some penalty active here and I want you to discover those things Okay, I should I should wrap up. I'm about a minute over here So, uh, there are lots of different kinds of spline strategies. I just showed you the most common the basis spline strategy You have to view knots in the basis degree as choices. They're like priors, right? But they're not probability distributions. They're choices you make that define the model the model will not question them Yeah, and so you have to deal with that there are ways of doing Bayesian splines Where the the number of knots is a parameter That you can get a posterior distribution for I haven't showed you that but you'll see that if you if you follow this literature for very long All of these things is that when we get to chapter seven I'll be able to say something more substantive about these choices and why we'd worry about, you know Why not just have as a knot at every year? For example, why not your computer can handle that by the way, you'll end up with like 1200 parameters But you can do it I guarantee you 1200 parameters and sounds frightening where your computer's like, okay, I'll get to work Right and it just starts chopping down parameters And eventually it'll work through it all There are other types of splines that don't require knots and they have other strategies For finding them automatically When we get to chapter 14, we're going to look at another smooth approximation technique called the Gaussian process and we'll be able to do comparisons between the two Gaussian processes are related to probably genetic regression Those of you who are evolutionary biologists here and I'll draw that connection for you when we get there Okay homework, I haven't finished the homework. I made the third problem too hard. I realized this morning So You're welcome And so I'm going to revise it and I'll post it in the afternoon after I've had some coffee and become a nicer person And now it's tuning and new homework problems are very dangerous and I'm very cautious about these things But it'll go up in this afternoon. You've got until the second week of january to do this Don't wait until the last minute, please It's not super hard homework though, but it's going to get more complicated because more pieces now When you come back in the new year, I will welcome you back into the new year to many joys Including multiple regression We will have a bunch of different predictor variables in there that opens up the possibility for us to look at causal Inferences these models aren't causal, but we want causal inference from them So we're going to confront that problem. We're going to look at causal graphs things called colliders And many other joys to come question We used as priors a Gaussian prior for a and b but a uniform prior for things. Oh, yeah Since this is not a conjugate prior We make some kind of error in approximating the mysterious I guess is there a way to quantify that error? Uh, that's a great question and a complicated one that uh, the answer is yes Absolutely when we get to markoff chains, that would be the time to ask that question again And then we can do the comparison Is that good? Does that make sense? Okay. All right. Thank you all and have a great week of great