 So, this is the second lecture on data simulation methods and the one where I want to introduce numerical methods for doing data simulation. And so, this pairs very nicely with, again, the lesson on uncertainty propagation where in that lecture, we discovered analytical methods for propagating uncertainty, linear tangent approach, the exact moment, exact transformation of the distribution, and the analytical approach to transforming moments. And we saw when I introduced the analytical approaches to data simulation yesterday that we could map the analytical approach to uncertainty propagation to the Kalman filter and the linear tangent approach to the extended Kalman filter. And so, just to refresh, again, we're thinking about this idea of a forecast cycle. We have some initial study to system. We project that uncertainty into the forecast. We have some new observations and we want to update that state in order to update our forecast. Again, the big picture idea is that we always want to be updating our forecast as new information becomes available. The implicit part of this is we want to update our information when new information becomes available without refitting the whole darn model to every observation we've ever had. Fundamentally, that becomes a problem that gets perpetually harder. That's it. If you have a super-uber-simple model, you may be like, whatever, I'm just going to ignore this because I don't mind refitting the model. I know Tom Hobbs who does some of these population assessments. He does them annually and he's got 10 years of data, 15 years of data. He just refits the whole darn thing every year because, so what if it takes a few hours to run? Is it once a year because it's a small model with an infrequent update? By contrast, if I was trying to update something more complex with a larger volume of data, I don't want to be refitting everything. I just want to be incrementally updating as new information becomes available. Again, we mentioned yesterday this idea of in the multivariate case, we also want to be able to borrow strength and fuse multiple sources of information. As I was saying earlier, we can map the Kalman filter and the extended Kalman filter onto these two analytical methods for propagating uncertainty. Today, I want to introduce our two numerical methods for doing iterative forecasting or data simulation that map onto these two numerical methods we know how to propagate uncertainty, the first being the ensemble Kalman filter, which uses an ensemble-based uncertainty propagation and the particle filter, which uses a Monte Carlo-based uncertainty propagation. Okay, so I'm going to refresh where we left off. Yesterday, it was on the extended Kalman filter, where we started, which was the simplest possible model, normal, likelihood, normal, process error, linear model, and tried to address that assumption of linearity in the forecast, and it did that by assuming that the forecast mean would be whatever nonlinear thing comes out of our forecast model putting in the analysis mean, which we know technically violates Jensen's inequality, and then we update the uncertainties in the forecast using our Taylor series expansion. So we need this Jacobian matrix of derivatives, which essentially gives us the instantaneous slope at that point in time rather than using a constant set of slopes in a linear model. So this had two potential problems. One is that we're violating Jensen's inequality, so this mean is technically biased and that we're assuming normality, but if we have something that starts normal and do a nonlinear transform on it, it can't technically be normal anymore. But as I said yesterday, those assumptions might be something you can live with or they might be fatal. It has a lot to do with the actual shape of the error in the real world and a lot to do with the degree of nonlinearity in the problem. But I'm going to jump back to my title slide. In my mind, I kind of contrast the extended Kalman filter in the ensemble Kalman filter is kind of summed up well in this quote, which wasn't about Kalman filters at all. An approximate answer to the right problem is worth a good deal more than the exact answer to an approximate problem. I kind of view the extended Kalman filter as the exact analytically tractable answer to the approximate problem. By contrast, I view the ensemble Kalman filter as an approximate answer. It's not analytically tractable. It's done through numerical sampling, so it's an approximate answer, but to what I think is the right problem. So why do I think that? So first of all, I'm going to point out that an ensemble Kalman filter, like in the extended Kalman filter, the analysis step is identical. So we're going to keep that assumption of normality, and therefore, we're going to keep the assumption of normality in the forecast. Let's keep the assumption of normality in the process error and the observation error, and therefore, we have this conjugacy in an analytical solution to the analysis problem. But we're going to do the error propagation using an ensemble-based approach. So specifically, we're going to draw some number of samples from the analysis posterior, and then we're going to run the process model forward with whatever process error is in that model, you need to add to that model. And so what we get out is an ensemble of M forecasts. What you do after that is actually relatively straightforward. So I have an ensemble of M forecasts. I need to plug that in to an analysis step that assumes that those M forecasts follow a multivariate normal distribution. Well, how do I get from M forecasts to a multivariate normal distribution? Well, I just use the sample mean and the sample variance. It's really that simple. When we introduced ensemble approaches to uncertainty propagation earlier, I always found that they're fairly conceptually straightforward. If you can sample from something, you can transform those samples through your model, and you get samples as the output. And remember, when we transform samples, we are actually getting, we're dealing with the nonlinearity correctly. So this is why I think we're getting the approximate answer to the right problems. We're dealing with the nonlinearity in our model as they are. The approximation comes in from the fact that we're just using the sample mean and covariance, and again, assuming normality. And the degree of approximation here improves the larger your ensemble size. So if you use a really tiny ensemble size, you get a very approximate answer. If you use a larger and larger sample size, you get a better and better description of this prior mean and prior covariance matrix. To show that graphically with an ensemble size of five, which I would never, ever do, but it makes for a more clear picture. You have some initial conditions. You draw some discrete number of samples from your initial conditions. You simulate those forward. They give different predictions. So here, this one started high, it ended up in the middle. This one started in the middle and ended up high, whatever. So that's my distribution of predictions. I fit a mean and variance to there. I then do the math of the analysis, which remember is just a weighted average between the date and the model, each weighted by the summing of the precision to get the updated analysis. So I have an observation. I shrink towards that observation relative to its uncertainties. I sample again, I forecast again, update again, forecast again. That's really actually not that complex. So here is a simple figure comparing a common filter and an ensemble Kalman filter. So the common filter, the solid line with dots, is the exact same figure I showed yesterday when I was introducing the Kalman filter. Here's an ensemble Kalman filter with an ensemble size of 10. And this is just making the point that even with a fairly small ensemble size for a univariate problem, they give you virtually the same results. And asymptotically, as I increase this sample size, it will converge to being the same results. And it does so without having to take derivatives. So some of the pros and cons of the ensemble Kalman filter, which my opinion and observation is that if we look at all the trade-offs across the different methods we're going to use, this is the probably most commonly used approach to data simulation and ecology right now, because I think it balances a lot of these trade-offs fairly well. It's something that is intellectually tractable, analytically, it's intellectually tractable. The implementation is usually fairly straightforward. It has computational demands, but not huge computational demands. So pro, it deals with the non-linearity correctly. So you have no violation of Jensen's inequality. Your model can be as crazily non-linear as you want. If your model is hugely crazily non-linear, and you throw a whole ensemble in, the ensemble might give you spaghetti output, but that's actually the correct spaghetti output. That's just the crazy stuff your model does. But it doesn't make any approximation to that. It uses your existing code. You don't need this Jacobian. You don't need this linear tangent version of your model. What you do need, and you need this anyway, so for all of the data simulation variants, you need the ability to restart your model. And that's actually with a lot of these process-based models I work with. The biggest limitation for data simulation is you need to be able to restart the model, stop the model, muck around with its current configuration, and then restart it under the analysis state. One thing that I'm going to mention here is that a really important test to perform when you're working with a data simulation system, building a data simulation system, is what's called the hop test. So envision the hop test is, imagine I have time in some state, x, I start my model at a specific initial condition, I run it forward. Then I try this in data simulation mode, but without any data. I run it forward, I stop the model. I then restart the model, feeding it back in. It's a current state. It should go and give you the same result. If it goes off and gives you different results, your model does not have the capacity to restart correctly. So the hop test is just verifying that if you stop the model and restart it, that it goes and gives you numerically identical results to what you had previously. Failing the hop test is usually because your model didn't write out all of its state variables before it stopped, or it didn't read in all of its state variables when it restarted. With more complex models, that's really easy to do. Like you have some sort of latent thing that you assumption. I assume when I start the model, I just initialize all these things to 0. No, when I restart the model, I initialize them to what they were when I stopped. You miss one, and suddenly you can't restart the model correctly. So if you're doing the hop test on a stochastic model, you also need to write out and read back in the seed on your random number generator. Or else it will not be numerically identical. But it's a good test to perform that your model doesn't give different results when you restart it. Because when we do data simulation, we're going to restart it under different conditions. And if it can't even, you don't want to go crazy. I'm only mentioning it because I've yet to work with a process based model that passes the hop test the first time. OK. But like I said, you need to be able to do that to do the Kalman filter or the extended Kalman filter or the particle. All of these approaches to data simulation are based on the idea. We're stopping the model and grabbing new information and restarting. Back to the ensemble Kalman filter. It is simple to implement. You just need to be able to sample things, run your model. Simple to understand, I just need to calculate the ensemble mean and covariance. How big should your ensemble size be? That's really a basic power analysis question. If I have some amount of variability in my ensemble and I want some specific amount of numerical precision in my estimate of the mean and covariance, what sample size do I need to get a specific standard error? That's all it is. This is really stats 101. You know, I have some observed variability. I want a specific standard error. I need to solve for the sample size that gives me the standard error I want. Just do a test run to figure out how variable your ensemble is. And remember that the standard error in many ecological problems, the process variability in the model is fairly non-trivial. We have a substantial amount of variability such that you don't necessarily need to estimate your mean down to like 12 decimal points if you only trust the model to one decimal point. Maybe you only need to get this down to one decimal point. So that means that you can get by with reasonable sample sizes. The con of the ensemble Kelvin filter. This computational cost is larger than the analytical methods. With analytical methods, you only run the model forward once. And the uncertainty propagation is all done analytically. One other thing that's, I think, another advantage of the ensemble Kelvin filter is it is particularly easy to add additional sources of uncertainty. So if I wanted to add uncertainty in the drivers, I just sample the uncertainties in the drivers as well. If I wanted to add uncertainty in the parameters, I just sample the parameters as well. This is exactly what we did on the uncertainty propagation activity. We just want to propagate five sources of uncertainty. We just sample from all five. If you're doing an ensemble Kelvin filter, that's all you have to do. If you can sample from it and put it in your ensemble, you are propagating that source of uncertainty done. By contrast, if I did this with the extended Kelvin filter, every time I add a term, I need to add the derivatives with respect to that term. So now I don't just need the Jacobian of all the state variables and how they relate to each other in the model, but I also need all the derivatives of the forecasts relative to every input. And I need the derivatives with respect to every parameter. That can be really messy. And again, just to give you a linear tangent approximation, moments, we are good on Jensen's inequality because we are transforming individual ensemble members, not the distribution all at once. Like with the extended common filter, we are making the assumption the forecast is normal, but we know technically it can't be. That's the assumption, but again, that was the assumption we made with all ensemble-based methods, which is we have to make, in order to work with a small ensemble size relative to the Monte Carlo methods where you want thousands and thousands of iterations so I can draw the full posterior distribution just from the samples. If I'm working with tens to hundreds of samples, I need to make some sort of distributional assumption. The ensemble common filter, in order to do this conjugate math in the analysis step, assumes that's normal. But again, it's worth being aware of that. In many cases, it's not a huge deal, but it's worth checking. I will say that. As you start running ensembles, I would definitely say it is worth looking at a histogram of your ensemble projections to say, does this look approximately normal or not? If you look at the projections of your ensemble and they look like bimodal and you stick that in to something that assumes a normal, you're making the mean parameter. Imagine that. Because it's actually not hard to generate that. Let's say I have some x and the histogram of x looks like, actually, I'm going to draw it this way. I'm going to draw it exactly how it happens in my models. What happens in my model where x is often biomass? Everything is happy. Everything dies. Everything dies could be choice of parameter combinations that are as enviable, a met ensemble that imposes some severe stress, a driver disturbance event that causes a disturbance. So this could just be no fire, fire, no hurricane, hurricane. Now, if I fit a normal to that, the mean doesn't represent any of my ensemble members at all. So I'm putting in my mean forecast as something that never actually happens. So those are the sorts of cases where you go, well, that's a bad assumption. I need to do something a little bit more sophisticated. But if, on the other hand, you only had this case, you're like, well, sure, that's approximately normal. I can live with that. And that's the key point, last point here. That analysis is not hard to generalize, but it just won't have an analytical solution. When you're dealing with problems that are so big that the computation on that analysis step really has to have an analytical solution to be able to do the forecast. So if you look at the weather forecast, where every six hours, they have to update the analysis with, put this in perspective, NOAA's current operational weather forecast system right now, assimilates 10 terabytes of data per day. They can't MCMC their way through 10 terabytes. So two and a half terabytes every six hours. I can't even download two and a half terabytes every six hours, let alone assimilate it. They have more computers than I do, by a lot. NOAA's next generation analysis system is being designed to assimilate 10 terabytes an hour. So anytime you think that ecologists have entered the era of big data, you talk to our brethren in the physical environmental sciences and go, oh, we're small potatoes, really. We can afford to do things the hard way for quite a while still. OK, a couple of things I wanted to introduce that are kind of additional flavors on this theme. So in the standard version of the ensemble Kalman filter, when you get that analysis posterior, you just sample from it. Someone at some point came around the idea of, well, there's a number of cases where just completely re-sampling from it might not be exactly what I want to do. Is there another? Instead of re-sampling, can I just nudge the current ensemble back towards the mean? So if I have, and this is actually what I visually showed earlier, which is if I have some set of ensemble members coming in for the forecast and then my, so they're summarized by this distribution, and then the analysis says they're supposed to have, so that they have this current mean and covariance. They need to have this mean and covariance. The standard Kalman filter just says, you take this, you take that, and then I just draw samples from this. And they have no connection to the samples from this. Alternatively, you might say, well, what I want to do is just nudge these so they stay in the same order. They represent the same quantiles. But they've shifted the mean and shifted the covariance. If this is a univariate problem, this is really easy. I mean, all of you could figure out the math for doing that. This is just the matrix version of that math. The matrix version of that math essentially boils down to, here are the residuals between the forecast mean and the observations. I'm going to standardize those to be z scores. This is just a bit of matrix math that essentially means divide by the standard deviation. I then take that z score, essentially multiply it by the new standard deviation, add on the new mean, and get it back into the new domain. To do that, you need to be able to calculate a singular value decomposition, which is something that's in every programming and statistical language in the world. There's an R package for it. There's a Python package. It's built into everything. It's very straightforward and a bit of math. So when would you want to do this, though? A couple of key places. One would be if you have latent states that you did not write out. So if I have 12 x's in my model, but there's actually a few more things in the model that I need to restart, but I'm not choosing to nudge, then I might want to keep those consistent from one state to another. Other place where it's particularly useful is when I'm propagating other uncertainties and I want to keep them associated. So imagine that part of why I got this ensemble spread is I have a particular parameter set of parameters here and a different set of parameters there. And this parameter set that caused this to be high and this parameter set that caused this to be low, I want to keep that high one here and I want to keep the low one there. So that would be a reason why I would want to keep this ancillary information with each ensemble member as I move it from one to another. It's similar with the drivers. I have a particularly wet or dry set of ensemble members from my drivers. I might want to keep them associated with their previous ensembles. Another thing that actually does apply to all Kalman-Filter variants as well. I mentioned that when I introduced the Kalman-Filter that the real computational cost of the Kalman-Filter is that matrix inversion. So as the number of states in my model gets large, the size of that covariance matrix gets very large, and the math for the analytical solution of combining the two involves a number of places where you have to invert matrices, particularly that covariance matrix. The vanilla flavor for how computer scientists invert a matrix is what's called an order n cubed algorithm. And that means the time it takes to do this computer operation scales to the third power of the operation. So if imagine I had a matrix that was 100 by 100, and let's say that took one second to invert. If I then switch to something that's 1,000 by 1,000, I made n 10 times bigger. Now it takes n cubed longer to do that computation. So you can see, even if this is a millisecond or smaller, when these matrices start getting bigger, the computational cost of inverting them ends up being really non-trivial. Enter into this the applied mathematicians that have a number of non-vanilla algorithms for inverting matrices. But they work when matrices have very specific structures. So a diagonal matrix, I can invert order n because if the matrix is diagonal, the inverse is just 1 over everything on the diagonal. It's trivial. You wouldn't actually want a covariance matrix to 1 over diagonal because then we've lost that whole borrowing strength thing. But if your matrix has, say, blocked diagonal structures or basically things where you can make the matrix sparse in very specific structures, sparse meaning you put in a lot of zeros. So how do you make the matrix sparse and often sparse with some structure to it so you can employ more sophisticated approaches to matrix inversion? The key way we do this is an approach called localization. And the idea of localization is that you assume that the correlations truncate to 0 beyond some distance. So we know that correlations are in the k as a function of distance. It's implicitly generated by the models. And we just lop them off. That creates potentially a lot of zeros in a nice, often diagonal structure. So if you imagine that your x's were different locations in space, if I truncated them after a certain distance, I would end up with a more diagonal structure, which is something that I can invert more easily. So that increases the computational efficiency that can also help us avoid spurless correlations. So if I have thousands of state variables and hundreds of ensemble members, I want to get strong correlations every once in a while just by chance. And if I leverage those, I can get wacky things happening. So Andy Fox, who was running data simulation at Neon before that was descoped, had this one example where there was some phenological transition or some flux whereby what he strongly believes to be a spurless correlation, there was a strong correlation between Alaska and Florida. So data in Alaska suddenly changed the predictions for Florida very strongly, because the correlation was 0.95. And it's like, that makes no sense biologically. It just happened because with that number of correlation coefficients being calculated, it's going to happen every once in a while. So you kind of lop off and say, even if I do find a correlation beyond this distance, it's probably not real. Also noting that that distance doesn't need to be physical. You can drop correlations beyond some distance in whatever way you want to define distance, community similarity, some sort of other environmental space, some other measure of distance that Manhattan doesn't whatever. However you want to define distance, the idea is you have some rule that you say, after some, there's some metric that you say, I'm going to start chopping things off and setting the correlations to 0. Another thing to be aware of in all forms of data assimilation, not just ensemble ones, is this idea of filter divergence. So here I've set up a simple data assimilation example. I don't even remember if this was ensemble or Kalman filter or whatever. But the idea is I started off two versions of the assimilation at the same starting point, same initial conditions, pretty high uncertainty. I am assimilating the same observations. But what's different between these two variants is what I've assumed about that process error. Remember in the standard Kalman filter variance, you have to assume that process error. In one case, I said the process error is fairly high. In the other one, I said it's fairly low. So the model doesn't have much unexplained variability in it. In the model with higher process variability, it's following the data. And then there was kind of a regime shift in the data at time 10, and it shifts to the new mean. When there was low process variability, the model observes data. It gains information from that data. It becomes more precise. And so the precision gets more and more, gets more and more, and the regime shift occurs. The model says, I'm very confident in the model, because I gave it a lot of weight, because it's uncertainty is small. Data has higher uncertainty in the model. So the model continues to get most of the weight. So filter divergence is essentially what happens when you become falsely overconfident in your model, causing the model to stop following, to diverge from reality. When we do the analysis, we're summing the precisions of the two so we always get more precise, and then I'm weighting them by their relative uncertainties. What happens is your uncertainties reach an equilibrium in many cases because you have this, the precision from the data making the analysis tighter, then you have the observation error making it larger again. So if you kept doing this cycle of analysis pulling you in, process error pushing you out, analysis putting you in, process error, you hit some, essentially you hit a steady state. And now you've hit a steady state where the model is falsely overconfident. This is usually, this isn't just an issue of how you tuned your process error, this is largely just a reflection of you've ended up in a place where you are falsely overconfident in your model. You have not propagated, there's some uncertainty about your model that you're not propagating in appropriately. So this filter divergence, I will say, the folks that I know that do data simulation for atmospheric scientists are spent a lot of time worrying about filter divergence. This idea of a model variance collapsing to zero or some small amount whereby it ignores the data because the model that ignores diverges from the data, their solution is to tune the process error to stop that from happening. It stops that from happening, it has no theoretical justification. So I label that as not necessarily the best solution to your problem, tuning knobs to get the answer you want. It's also worth noting that ecology is far less chaotic than the atmosphere. So if this happens in atmospheric science where they have a chaotic system, this is even more prone to happen in ecological systems where we have stabilizing feedbacks. But that also means that occasionally convergence of your ensemble members to one answer is the right answer. So if I'm trying to predict phenology, my predictions for what percentage of leaf on are we at right now in the middle of summer is 100%. If I do this as an ensemble forecast, I'm fairly confident every ensemble member will say it is 100% and it will ignore any data that says it's not because what I'm gonna see in the data right now is just sampling noise. So in that case, all of my ensemble members converging to the answer is not the wrong answer, even if it causes the model to start ignoring the data. It's a problem if the model in August doesn't say, well, actually there's some probability that we're going into an early fall. And so if we go into the fall, the process model needs the ability to recreate that ensemble spread again. So convergence by itself is not a wrong answer if that convergence does represent true understanding of the system and it's stabilizing feedbacks. But in many other cases, it can really indicate miss specification of the process model or miss partitioning of the process error. So very often it means that there's some source of uncertainty in the real world that you did not put in the model. So if I, you know, this is easy way to get that. You know, if I take one of our models and run it in ensemble Kalman filter just with initial condition uncertainty, I'll get filter divergence fairly easily because I'm ignoring uncertainty in the parameters. I'm ignoring uncertainty in the drivers. If I start ignoring sources of uncertainty, then I become falsely overconfident in the model. Or like an example I had before, you know, if in the real world there's disturbance and in the model I'm ignoring that disturbance can occur, all of my ensemble members will converge on, you know, nice, late, successional forest. And when I get observations that say, no, that force just burned down, my ensemble's gonna ignore it if I didn't put in that there's some probability of that disturbance occurring. On the flip side, that's actually one of the strengths of data assimilation in my mind, which is that there are stochastic events in the real world like disturbances and I sure as heck want to update my model's forecast the second I observe it. So if I observe that a fire has occurred, I want the data assimilation system to say, oh no, our forecast shifts to a completely different state right now. I need to reconcile that with reality. So even if you had a model that was perfect in its representation of process, it can't be perfect in representation of stochastic events and so you'll always want to be assimilating those. Reiterating that process error is important but none of our Kalman filter variants can estimate it explicitly. So if we have, again, for example, a simple random walk state space model has priors on those two things, the Kalman filter variants don't have priors on those things because they're not treating them as unknowns. If you put priors on them and treat them as unknowns, you can do that but you lose the conjugacy and the analytical solution. And we'll come back to that again later in the lecture but what I want to do now is move on to thinking about what if we forecast with a larger ensemble? So what happens if we use a large enough number of ensemble members that we can approximate the distribution just with that sample? We go back to this Monte Carlo approach to propagating uncertainty because the only real disadvantage we saw in the ensemble Kalman filters we were making this normality assumption and it might not be appropriate. We don't have to, we know from MCMC we can approximate any distribution just by samples from it. Why can't we do the same thing in data assimilation? So we can eliminate these distribution assumptions, we can eliminate this normal, normal analysis, we can start dealing with estimating the process here and things like that, but how do we do it? So the problem with the Monte Carlo approach to uncertainty propagation isn't in the forecast step because in the forecast step, just run a bigger ensemble. You run a big enough ensemble, you approximate the distribution with samples from that ensemble. The problem then comes with the analysis step. My prior is now just a pile of samples. I'm not writing down a distribution because the whole point of this was to not write down a distribution. So how do I update that analysis? How do I update the system? So the idea here is behind this which is what's called the particle filter is imagine I have some set of initial conditions, I sample from it and I run an ensemble. Here I've got a finite, small number just for visualization purposes, but remember for a Monte Carlo approach I'm gonna wanna run like 10,000 simulations because I want to be able to approximate the probability distribution at all times. So this is my set of samples and the particle filter essentially calls each ensemble member a particle. That's where it gets its name. It's envisioning these as particles moving through. I now observe some data. What do I do? So I look at that and I go intuitively, these ensemble members just became a lot less likely. These ensemble members became more likely. The likelihood tells me that. How do I convey that information to these ensemble members? That I like these ones better and I like these ones less. So the key idea is the key way that I tell these particle members that I like this one more and I like this one less is I give them weights. So when I start out, I sample from the prior and since I'm sampling randomly from the prior, all samples from the prior are equally likely. They all have the same weight, weight equals one. But then I observe some data and I can say, well, this value has some likelihood. It's non-zero, it's not particularly high. This one has a higher likelihood. If I had one here, it would have the highest likelihood and these guys out here have a very, very low likelihood. So I give the values out here very low weight and I give the values over here much higher weight. The weight I give them is the likelihood. I use the likelihood to give them that weight. Okay. So now every particle has two pieces of information. It's state and it's weight. What do I do? How do I use that? I panic, whoa, okay, wait, this has made things harder. Okay, but I go, okay, no, no, this isn't actually that bad. I know how to calculate a weighted mean. So if I want the mean of my ensemble, I just calculate a weighted mean instead of an unweighted mean. Okay, turns out I can also calculate a weighted variance. That's not actually that hard to do. I can calculate weighted quantiles. It's not hard to do instead of doing the cumulative distribution of all particles you're giving them distribution according to their weight. So you're summing up their weights in order. I can do basically anything that I want to calculate from my posterior, I'm calculating the weighted version of that thing. So confidence interval is done with a weighted quantile, mean, variance, all the weighted versions. So we're actually good. As long as we remember to use those weights, that's the only, you know, I can make a weighted histogram perfectly fine. If I'm just using those weights, I have a way of carrying that information forward. Now I can get a, I make a second observation. So I get a weight from that second likelihood. So the total weight is now the product of those two weights. And if I get a third observation, I'm just continuing to accumulate weights. And I continue to accumulate weights. These guys that are consistent with the data keep getting more and more weight. These guys out in the tails keep getting lower and lower weight. So in the basic vanilla flavor of the particle filter, weights provided by the likelihood, our posterior is our likelihood times prior, any estimates based on those weighted versions. In the literature, if you're doing this in a non-time series model, this is often also referred to as sequential Monte Carlo. Problem, weights can converge on one or a small set of ensemble members. So if I keep doing that process, those values that are near the middle keep getting lots and lots of weight. And those values that are out in the tails keeps getting less and less weight. That really causes two very similar problems, two related problems. One, the whole idea of a particle filter is that I'm approximating my distribution with samples from my distribution. But if I'm giving a lot of weight to a small subset of those and very little weight to the others, the effective sample size gets very small. So I'm actually doing a very bad job of approximating that histogram. So if I kept doing this and I gave 100% of the weight to one ensemble member and zero weight to the other 9,999, my histogram is just a point estimate and is a really bad histogram. At the same time, those other 9,999 that have virtually no weight, I keep doing all of the cost of carrying them forward and I'm getting almost nothing out of it. Like I incur the cost of running the model, I incur the cost of comparing the model to data, but they contribute essentially nothing to my posterior. So the idea of the resampling particle filter is to try to deal with both those problems. And they're exactly the intuition that you had, which is I want to resample from my posterior. And I wanna, when I resample from that posterior, things that have more weight are resampled more because I'm resampling in proportion to the weights. So if things that have high weights get resampled more, things that have low weights get sampled less often, things that have very low weights may get dropped from the ensemble completely. So let's look at what this might look graphically. So I might start with a set of forecasts coming out of different ensemble members and imagine this line represents the likelihood surface. And so these guys get low weights. These guys get higher weights. And so here is what I, here's my posterior in terms of kind of that weighted histogram. Obviously it's a very bad posterior because I didn't use nearly enough particles, but the idea is still like these guys get more weight, these guys get lower weights, these guys get very little weight. And I can work with that. I can work with that in terms of its ensemble mean, weighted mean, weighted covariance, et cetera. I can then, during the resample step, if I resample in proportion to these weights, this one which high weight might get cloned three times. These guys with intermediate weights might get cloned twice. These guys might stay and these guys with very low weights might just get dropped. So once I resample, now all the weights get set back to one and I just have a histogram again. So I'm now approximating the posterior as a histogram. I don't have to calculate a weighted mean or weighted covariance. I just have sample mean, sample quantiles, whatever. Now the key for the particle filter, the resampling particle filter to work is that there has to be process error in the model such that these three ensemble members, when I run them forward to the next step, don't just give me the exact same answer. Because if they did, I'd still have this problem of degeneracy. I would just be doing the calculation three times to get the exact same result and I could have just saved myself the trouble. You have process error causing these ensemble members to spread out over time and produce different predictions. And so I can go through this cycle of forecast, analysis, resample, forecast, analysis, resample. So this is just again the same idea in a different graphic now pointing out that this easily deals with multimodal distributions and non-normal distributions. And again, reinforcing this idea that we're carrying forward both the state and the weights. So yeah, forecast analysis, which gives us weights, resample. Forecast again, analysis gives us weights and then we could resample. This is the part where I will say, if you've never heard of this thing before, your brain may be going, what the hell is going on? This is so different than MCMC. It's not that different, but it is definitely different in an algorithm than the other forms of data simulation we've talked about. So when to resample? There's a balancing act. If I resample too often, I lose particles simply through drift. So imagine I started out, I took a sample from my prior and at every step there's no data, I just keep resampling from the prior. If I resample from that ensemble, I'll lose some ensemble members just by chance. I resample, I'll lose some by chance. So if I kept resampling with no information, I would still converge on one ensemble member just by a process of drift. So I don't wanna, it's not guaranteed that I wanna resample every single time step. On the flip side, if I don't resample enough, the filter converges on too few ensemble members. We get this degeneracy problem. We don't approximate the distribution well and we're spending a lot of time on computation we don't need. So there's no rules for when to resample. It's kind of like the jump distribution in MCMC. You kind of tune it to get something to behave well. But we typically do that using rules of thumb such as when the effective sample size drops below some threshold resample. So a very common one would be if the effective sample size drops by more than half, you might wanna resample. And the effective sample size is not hard to count. It's just the sum of the normalized weights. So this number works if you've standardized the weights to make sure they sum to one. The weights are standardized to sum to one and one over the sum of the squared weights will give you the effective sample size. So if every ensemble member is counted equally, the effective sample size is just a sample size. And this downweights that when you have some that are counted high and some that are counted low. And then just again, a reminder that when you resample set those weights back to one. Okay, so the pros and cons of the particle filter. The biggest con is the computation. You know, you have to have a problem where you can maintain enough particles to maintain a good approximation of your full distribution and building in the fact that you know you're gonna encounter this degeneracy problem. So if I'm resampling, if I'm resetting it every, say n over two, then instead of 5,000 I need at least 10,000, probably more. There's gonna be a curse of dimensionality here too. When you sample parameters, for example, you only get that initial sample. If you have a very high dimensional space, you could easily be missing parts of parameter space that are actually quite likely. So your prior, if you have a very wide prior and a very small ensemble, the odds that you had good ensemble members in that initial ensemble member ensemble maybe very low. So again, the less informative of your priors, the more samples you need to explore parameter space. The bigger the dimensionality of your problem, the more samples you need to make sure that you've explored parameter space. Because all you can do in the particle filter is upweight or downweight what you already have. Pros, it's actually not hard to implement. Other than that weighting bit, it's really not that hard. It's fairly general and flexible and it is parallelizable. You know, all those ensemble members, and this is also true of the ensemble Kalman filter. You know, if you have a big computer, you can put each of your model runs on different nodes or different CPUs. And you can also evaluate all of your parameters. So you can update the parameters and the process error and the observation error and all those things, you can update them. There's a but, though. There is a but, which is remember that the particle filter only works because when I resample specific values, they have to be able to have this process error that causes them to diverge back out, right? That's true of your state, but it's not true of your parameters. You know, when I resample parameter and I run my model forward, the parameter that comes out of the model isn't different than the parameter I put in the model. So even, so because parameters lack process error, they can be subject to that degeneracy that we can avoid by resampling for the process state. So there's a couple ways that are out there to try to deal with this, and they basically come down to trying to come with an approximate way to resample from the parameters, often involved by using some sort of kernel smoother. So the idea, so just step back. When we work with particles, the same as when we work with MCMC, we are working with a discrete approximation to a joint probability distribution. So we have a set of discrete parameter values in our parameters, just a list of numbers. So to be able to resample from that, we need to generate some sort of continuous approximation to that joint distribution to be able to resample from it. So because we need to be able to generate parameter values that aren't in our current parameter set, often done by putting some sort of smoothing kernel over that, and smoothing kernels are the sorts of things that we see all the time. Like if you run an MCMC, and you make a smooth density plot at the end, instead of the raw histogram, that's just a kernel smoother. This is nothing more or less complicated than that. But when you do that smoothing, you have to make a choice about how you're doing the smoothing, the bandwidth and stuff like that. So here's a very simple example, which is kind of the global Gaussian smoother, which is essentially just fit the multivariate normal to the current posterior parameter distribution, and then update the parameters, essentially with some weighting between no smoothing and complete smoothing. So in this example, if you set H to zero, then this goes away and you're essentially just redrawing parameters from this multivariate normal approximation to the posterior. If you set this to one, you're essentially not redrawing anything and your updated parameters are just your same current parameters. And you can kind of tune H to kind of give you something in between. Even better than just kind of tuning H to give you something reasonable, is to actually build in a formal metropolis hastings, accept or reject into these proposed moves. So you might, for example, split every particle in two, one with its current parameter value, one with a new proposed parameter value, run forward and when you get to the next time step, accept or reject which one of those you keep. So it does require doubling your particle size, but then you have a formal, rather than kind of a tuned thing, you kind of have a formal Bayesian way. So that together, so the idea of putting a resampling in with this metropolis hasting is called the reject move particle filter. And I'm not gonna go into their implementation. This is now kind of just giving you buzzwords to Google. If you ever decide to dive into it and you get to these papers that are like more equations than text, that's where I am right now. I'm at the like, I get the general idea, but I'm still scratching my head over some of the equations. Getting close to wrapping up. So just again, big picture. We have a bunch of different options for how we assimilate data into models, into forecasts. Irritatively, they largely map on the different ways that we propagate uncertainty. What about that analysis step? I wanted to come back to that. So why don't you just do the whole analysis step? Why don't we just do the whole thing by MCMC? So option one, just refit the full state space model every time step. I gave you, I said this earlier, that's exactly what Tom Hobbs does. If I've got 50 observations over 12 years and I refit it every single time once a year by hand, who cares? That's still an option. So we can always refit the full state space model and then just make a new forecast out of that. Option two, what if we just update the forecast coming out of the state space model? So if we treat the priors as samples, we can just do that as a particle filter. So in many ways, the MCMC can run seamlessly into the particle filter. Because we're just treating things as particles. Alternatively, we can approximate those with distributions and then we're essentially doing some generalized form of ensemble filtering where we don't have to assume the same conjugacy that Kalman did. But because remember, like with classical statistics, where I said everything was done so that there was an analytical solution, Kalman likewise had to do things so there was an analytical solution because MCMC wasn't invented when the Kalman filter was derived. No one was doing that. So here's a very simple example of generalizing this. And so here I've just got a simple dynamic linear model and actually started with doing it just with a standard ensemble Kalman filter assuming everything was known. And then I incrementally started relaxing those assumptions about what was known. But I refit it using a full MCMC at each analysis. I refit each analysis step using MCMC. So I didn't refit the full model, I just refit, I just did each analysis step by MCMC. So the first thing I did is I said, well, that assumption that process error is unknown is the least tenable of all these assumptions. So I wanna fit that. And it turns out if I treat that as an unknown, I get something that essentially converges to the same thing as the Kalman filter so quickly that they're essentially on top of each other. And I then said, well, what if the process and observation error are both unknown? Well, turns out I have no idea at all what I'm doing for the first five time steps because as I'm going through the analysis step, the model is trying to reconcile these two parameters and come up with estimates of them. But then they start converging and then by 10, 20 steps, I'm largely getting the same answer that I did without that. Then I said, well, what if I treat the parameters unknown? Now it takes me longer to start sorting out these three parameters that I need to estimate. But still, I'm trying to estimate three parameters with 10 observations, that's not unreasonable to expect that it takes a while to do that. By the time I have 20 observations, I can actually get decent estimates of those three parameters. And what I see after that is largely similar, not identical because, again, there's always uncertainty in these parameters that cause their posteriors to be a little bit wider. So this is not, example code for this is not in the GitHub repository for course examples, but it is in the GitHub repository for all the code that goes with the book. So you can just, if you ever want to look at this, you can see there's this process of refit it, running the forecast step with an ensemble, stopping, running a JAGS model to do the analysis, sampling from that JAGS posterior to run the forecast. We've generalized this even further in some of the work I've been doing with. And Reho, I mentioned a number of these features just in discussion over the last few days. And this is an example of some of Ann's work where we're trying to fit. So the data here is an estimate of above ground biomass and woody biomass increment that's coming from fusing tree ring data with forest inventory data. The model itself is a forest gap model that's projecting the forest dynamics in time. And so here we have the tree ring data, we have the model prediction at each time step, and then the pink bar is the analysis at each time step. So that's the fusion of the model, the data, always tight of reach and kind of following the same general mean trend line as the data. But in many ways, suggesting that there's inter-annual variability there that might be missed by the tree ring reconstruction. When we did this, we generalized it so that we treated the process error as an unknown. And so we fit that with the Wishart, we assumed a Wishart prior on that, which we update every analysis time step. And so from that Wishart prior, I can actually estimate the process error covariance matrix. This is just the process error correlation matrix because let's be honest, none of us can actually interpret what a covariance matrix looks like. But a correlation matrix these were all pretty good at. So diagonal is one because it has to be. So how do we interpret this? So how I interpret this is yellow birch and cedar have a positive interaction that is not being captured by the current process model. By contrast, yellow birch and cedar have a negative interaction with hemlock beyond the interactions that are ready in the process model. So, and that's not something that's gonna come out obviously from the data alone because again, if I look at the data alone, this stand is undergoing succession. So hemlock biomass increases, yellow birch biomass increases. If I just do the correlation between yellow birch and hemlock, it's positive. I can't detect a negative interaction between competitive interaction between those species, but I get it out of the process model and more importantly, what I'm getting out of the process model is that the interaction is stronger than what the GAP model already thinks because the GAP model already represents competition for light and water and nutrients, but there's something stronger going on that's not captured and that's captured in the process error. By contrast, maple just does its own thing. So that either means maple is ignoring the other trees in the stand or that the process model is already accurately capturing the interactions between maple and the other species in the stand, which is probably the better way of thinking about it. And then again, I mentioned this yesterday, instead of using a multivariate normal in the analysis step, we use this multivariate Tobit which allows us to zero truncate and zero inflate the probability distributions. That zero truncation, zero inflation doesn't really show up as important in these two, but when you do this and if you look at rarer species in the stand, it's very easy to have cases where specific ensemble members predict a species to go extinct and so you have zeros in the model forecasts or vice versa, the model predicts a species to be in the stand that isn't observed, so the data predicts a zero. So you get more zeros in both the model and in the data when you do this for the rare species than you would have predicted by chance. And so there, it's really handy. For these common species, the multivariate normal would have been fine. So some take-homes. The iterative forecast analysis cycle, data simulation allows us to continually confront models with data. All of these data simulation variants are forward only. Special cases is the state space model. So state space model is the parent of all of these. These are all just special cases. Forecast step, standard data simulation methods map onto our uncertainty propagation options. Analysis step, don't feel constrained by Kalman's normal, normal assumption. Take those assumptions into your own hand. MCMC is perfectly viable way of doing the analysis step except for very large compute problems. And that's all I had on numerical ways of approaching model data fusion.