 This is really my last lecture. And the thing I wanted to talk about here is fusing data. And by that, I'm talking about how we combine multiple sources of information. In the book and in the grad version of this course, I actually covered this earlier in this semester, kind of between, I think, right after state space models, but before uncertainty propagation and data simulation. But really I wanted you guys to get that material earlier so you could start applying it and figure it. We can actually kind of talk about how you deal with multiple sources of information a little bit later. So some of this is going to jump back to issues that have more to do with calibration, but they definitely come into play during the data simulation as well. One of the things I think is important about data fusion is that fusion is fundamentally about synthesis. I mean, it's about bringing multiple sources of information together. And synthesis is a key complement to forecasting. It's we need to do good forecasts. We need to do good synthesis. Synthesis is need of prediction. It's this complement to reductionist science. And it deals with the fact that virtually every system we study, there's no single data set that provides the complete picture of how that system works. And very often you have a lot of ecological processes where the processes we enrich just have multiple parts that we might have information on those different parts. So I'm trying to project a population that may have age or stage structure. And so I might have different information on different life history stages. I might have some information that's on the individual level and some information that's on a population level. If I'm working with a carbon cycle model or a biogeochemical model, I might have different pools and different fluxes. And I might have different ways of making measurements of those different things. Important thing about data fusion. Data fusion is far more than just the informatics challenge of bringing different data sets together. So yeah, it involves much more than just concatenating files together. And that would actually probably be the last thing you want to do if you have data that they were collected differently is to just put them all together in one file and lose the fact that they were collected differently. And the other challenges that we often have data observed at different scales, and there's different uncertainties associated with different scales and different measurement types. So we need to deal with the fact that we can't just have a single observation error or a single just plop down on top of all different data sources. It's also worth talking about data fusion, because if you look in the literature, I think there's a lot of naive approaches out there that just drop the uncertainties and drop the covariances. There was a time not that long ago where a lot of folks thought that the right way to build a model was to build models for each subcomponent, to put those subcomponents together. And then if that thing, when you put all your pieces together, it should make the right prediction. That's great. But when you calibrate each part of a model independently, you don't estimate the covariances between them. And those covariances are often essential. I think a great example that I've seen in literature I work with. So imagine I'm trying to predict the composition of forest. Specifically, let's say I'm trying to predict the leaf area in a forest. I could do this by trying to predict the leaf area of every individual species in the forest and then sum those independent predictions up. By sum those independent predictions up, the range of variability in that prediction is going to be huge, because each individual species is going to be fairly variable, fairly uncertain, and the overall prediction will be very large. But does that actually reflect our understanding of how canopies behave? No, canopies have emergent phenomena whereby it doesn't matter what species are there. The LEI in a certain biome is fairly predictable. And that's because predictions aren't independent. There are these covariances between them. And if you don't capture those in the process, you have to capture them in the error structure. So they have to be part of that calibration. So I'm going to go through a couple ways that we think about synthesis. I'm going to start with the simplest, which is meta-analytical methods. And these are things that have become more popular in ecology as ways of synthesizing information. One thing that's improving these days is the ability to access raw data from previously published studies. But there's a lot of legacy research out there where all you get from a published studies are the numbers in the published studies. So summary statistics. You might have a mean and a sample size and a standard error, or some summary statistic from different studies. And when you do a meta-analysis, you often have some effect size that you're interested in, such as difference between means, correlation conditions, regression slopes, that you're trying to use to combine information about. In the context of a lot of predictive synthesis and predictive forecasting, things that are often of interest to us in the work that we've done are things that map onto model parameters, the things that constrain specific processes. It's always worth noting that when you do a meta-analysis, you face this challenge of reporting bias that neutral or negative results are less likely to be published than positive results. So that can give you skewed estimates of parameters. The fact that this has become very common in ecology is a source of inspiration to me. So about a year ago, Frank Davis, who is one of the former directors of NC's, came to BU to start a sabbatical here. He's been here all year. And I was telling him about some of the forecasting stuff we're doing and how we're trying to advance ecological forecasting. And he kind of told me the story that when NC's started, you could spot any NC's paper in the literature very easily. And they knew that they had essentially achieved their mission goals. When you would look through the literature and you'd find lots of synthesis papers and you couldn't tell which ones were coming from NC's working groups and which ones were just the rest of the field. And so that's kind of my inspirational thought about it. That's what I want ecological forecasting at this point. I expect that for the next few years, I'll see you guys publishing on ecological forecasting. So when folks that haven't come through here are the ones publishing on it, I think we're starting to make a difference. There's lots of ways of doing meta-analysis. Figures like this really highlight in my mind the connection between meta-analysis and some of the things we've been talking about over the past few days about updating our inference as we go along. So often you can take, meta-analysis might take a series of effect scores, each of which may have a lot of uncertainty and try to synthesize them into an overall aggregate story that combines the information across all of these different individual studies. Often reaching the conclusion that we can be very confident, we can reach a conflict conclusion by the sum of a whole bunch of fairly unconfident results. What I like is this figure here, the cumulative meta-analysis that says, what if we start with the first study and then what do we get if we combine the information from the first study with the second study? Then what if we do if we combine that with the third study and then combine it with the fourth study, then combine it with the fifth study? What you're actually here seeing is essentially this Bayesian updating process that's similar to what we use in forecasting. Every time there's a new study, they're adding that information. I mean, they're doing this all retrospectively, but the thing that I find amazing about this, you can realize that as a community you were very confident that you could distinguish your hypothesis from some null model, often quite early, and you're just gaining confidence on this and things like these huge sample size studies actually don't really change what was already a clear picture. But I like the iterative version of this. I want to touch on one specific version of a meta-analytical model, which is one that we've been using in some of our work, which is to think about meta-analysis in the context of a hierarchical Bayes model. So one of the things that separates a meta-analytical model from any other model is that in the data model, you don't have the raw observations. You have things like sample means, sample standard deviations, and sample sizes. You have some summary statistics. So what we do is we then write down as our process model stage relating something, for example, what we believe the latent true mean of that study was given the observed sample mean and summary statistics. So we actually have a layer that tries to say what is the mean of that study, even though the mean of that study was what was reported in the study. But what's reported in the study is the sample mean. And so we might say, for example, the sample mean might be normally distributed around the latent true mean given some standard error that depends on, say, a within site variability and then the actual sample size. And we might similarly write down a model that might say that the within study variants might be related to the actual variants and, again, the sample sizes. So one of the things that's important about a meta-analysis is when you combine information across multiple studies is they don't all count equally. Studies that have larger sample sizes tend to count more. Studies that are more precise tend to count more. We can then, at the hierarchical level, write down a model, for example, just a simple mean describing what's the mean of these individual studies and then the cross-study variants. These are unknowns and so we need to put priors on them. Likewise, we might write down a model describing the study-to-study variability in the variance itself. And again, we need priors on these parameters and we need priors on those parameters. So we could actually end up with a hierarchical model of the site-to-site mean, hierarchical model of the site-to-site variance. In practice, we often make the simplifying assumption that the within-study variability is actually fairly similar, which just gives us, turns this into a prior, takes that extra hierarchical layer away. So we've actually, in our system, operationalized this into a meta-analytical model that runs on trait databases largely in a fully automated way. Anytime we run ecosystem models that goes into the trait databases we're connected to, pulls the latest trait data down and updates parameters. So we have a system as we start with uninformative or expert illicit in priors on model parameters, we then update those with these trait data. So this is just a graphical version of what we're talking about. We're fitting an overall across-site mean, a random site effect. We also include in our version a random treatment effect because often we are synthesizing information that's coming from experimental studies, and so we might have different treatments. And then we actually, since we're dealing with plants mostly, we have a fixed effect for any sort of greenhouse or potted plant study, knowing that the traits that come out of those studies may be systematically biased relative to those that you find in natural systems. And then again, we combine those to get study-specific means and constraining them by the observed means, observed sample size and observed. The next thing I wanna introduce is what if we wanted to assimilate all of the data at once rather than doing this as an iterative process? Though pointing out that mathematically they are equivalent. So if I have posterior one is proportional to likelihood one times the original prior, posterior two is proportional to likelihood two times prior one, posterior three proportional to likelihood three prior two. That's actually a mathematical equivalent to like saying posterior three is proportional to likelihood one times likelihood two times likelihood three times the original prior. So whether you do, it's worth noting, whether you fit your data all at once or fit iteratively, you should get the same answer. In practice, because we have issues like the thing that comes out of the posterior is a set of samples and then you either have to particle filter them or assume a distribution. You can lose a little bit of information when you do it iteratively, but on the flip side, if you do it iteratively, you don't have to go back and refit your whole model anytime you get new information. So those are trade-offs. But either way, whether you're doing this iteratively or all at once, I wanna point out the idea of what if I have some process model that's making a prediction and doesn't matter what this is, but I'm predicting the mean given some covariates and some parameters, and what if I have K different types of observations that are relevant to that prediction. So I'm predicting something and I have a bunch of different ways of observing that thing. So I might end up with K different observation models. Which essentially translates to K different likelihoods describing the probability of that data given the model's predictions. So how do we combine them? So I'm gonna give a simple example of just a regression. I feel like regression is something that people can get their heads around. So imagine I have two types of data that both tell me about the relationship between X and Y. And let's also assume that they are actually measuring the same X and Y, which is an important assumption because it's not uncommon for when you have two different approaches to measuring what are supposed to be the same thing, that they're actually measuring slightly different things. But let's here assume that they're actually measuring the same thing, but what I'm seeing is a trade off between one method, the blue method, which might be cheap, but more uncertain, and the green method, which is more precise, but probably was more expensive so I have less of that data. So I have precise data, less of it, less precise data of more of it. I know lots of people who would just be like, well, let's throw out the blue data, it's less precise, let's just use the green. Well, let's throwing out information, especially there might be non-trivial, the total amount of information contribution might actually be similar. Because with enough low quality data, you actually still end up often getting a constraint. So here's what I would get if I fit a Bayesian regression to these two data sets independently. You can see there, I get slightly different slopes, but they're not incompatible with each other. And I could fit each of these independently by just writing down a simple, this is just the JAGS code for how I might write down a regression. I have some prior on the slope of intercept, I have some prior on the standard deviation, I loop over all the data, I have my regression model, and I have my data model. How do I expand this to fitting both of these data sets at the same time? Because that's what I wanna do. So I wanna fit this, which is the synthesis fit across both data sets. So here's an example of how I might do that. So first, I have one set of priors because I'm fitting one line to both data sets. I have the same exact process model with the same parameters in both places. But when I fit one regression, it has its own observation error. When I fit the second regression, it has a different observation error. So I have two likelihoods each with their own uncertainty, but fitting the same underlying process model. What we get out of that is kind of what we would expect given all that we've talked about in terms of updating forecasts and stuff like that. We get the resulting line is more precise than either individually because we've combined their information, we're using both pieces of information. Obviously, if these were not measuring the exact same things, your observation models might end up being more complicated. So for example, if you took one of these types of observations as truth, you might need an observation model on the other that involves some sort of calibration process for it being a proxy. So I might actually end up with a linear model relating the latent state to do a proxy variable or some other thing relating that latent state to that proxy variable, which is completely fine and valid to do. So I'm gonna give a few more and more complicated examples of trying to combine multiple pieces of information. This one comes from the plots I actually did in my dissertation work on in North Carolina. Shannon's lost almost as much blood to these plots as I have. There's lots of things that like to bite you in them. But this is aerial imagery at fairly high resolution where you can pick out individual tree crowns and a colleague of Shannon and I, Mike Wallace, and for his dissertation was actually digitizing the individual tree counts to estimate how much light those trees received. But we realized that we actually have multiple pieces of information, not just the remotely sensed image, to try to understand how much light each tree is receiving. And so in the end, we ended up realizing we had three, we actually had three pieces of information that actually enter through four different likelihoods. So we had this remotely sensed estimate of light availability and what we're trying, okay, so first of all, what I'm trying to estimate here is lambda, an estimate tree by tree of its light environment, how much light is reaching each individual tree. I have an estimate of that that's coming from the remotely sensed imagery. I can measure the size of its crown and translate that to the scale of light estimate, which I think is actually expressed in terms of crown area. But it's measured with some uncertainty, but we also found that remote sensing doesn't see every tree in the forest. It mostly sees the ones at the top. So we also ended up with a logistic regression describing the probability that you're observed in these images versus not being observed in the images and that that itself was a function of light availability. So whether we saw the tree or not in imagery itself gave us information about how much light a tree was getting because if we didn't see it, we know it was receiving less. So in that case, in some sense, the missing data was actually information. Then we have the sort of field data that traditionally gets collected if you're wandering around in the woods doing a forest inventory, status. Was this tree in the canopy? Was this tree in the mid-story or was this tree in the understory? And so we get estimates of light as a function of a multinomial regression for those canopy statuses. And then we had a mechanistic model. So we had mapped every tree in this stand. We could put it into a 3D ray tracing model to predict each tree's light environment. But those models are also imperfect. So we have a mechanistic model and the overall estimate of the light environment is the synthesis of information coming from this remote sensing, coming from this field data on the canopy status and coming from this process-based model on the light environment, each of which has the same latent variable entering in as the X on all of them. So four likelihoods all constraining the same latent variable. So in these examples, we've talked mostly up to now about combining multiple sources of information at a specific point in time, kind of a snapshot, which is very common for sort of regression analyses. I want to next think about how we combine information across space and time. So essentially bringing this idea of data synthesis into thinking about our state-space models. So imagine we have our state-space model, some latent X that's evolving through time according to our process model. We have some observations Y. And one of the things that we learned from Shannon's lecture is that the state-space model is fairly robust to missing data. So I might not have an observation for Y here. I might have an observations for Y here and here. And that's fine. I'm kind of borrowing the strength from this to make an inference about this. But we can apply what we did, for example, with a simple regression model here as well. I might have a second set of observations that constrain, that give me an estimate of X, and they might have their own observation model, they might have their own observation error. Similar to that first example of regression that had two likelihoods, each related to the same underlying model. And the nice thing about the state-space model is it's very flexible. And you can deal with the fact that maybe I just have one, maybe I just have the other, or maybe I have both. So when I have both, I'm gonna have the most constraint on X because I literally have four pieces of information. The previous X, the next X, Y and Z constraining that. Well, in these other cases I have, here I might have two because I have the previous and the observation I might not have a future observation. Likewise, if I'm at the start of the time series, I might have two. So you have different levels of constraint in proportion to the different number of likelihoods. All of these are Gaussian, the constraint, again, just ends up being the sums of all of their precisions. If they're not, it's conceptually similar. So it's straightforward to extend the state-space model to take into this idea of trying to fuse multiple types of observations. What to do if they come from different scales though? So there's basically two options. One is to scale the process model and the other is to aggregate the data model. So let's look at what that might be. So first option, I'll think about this in terms of time. Imagine I have Y is operating on a coarser time scale than Z. So Z is at a fine time scale. I might choose to write down my process model at the same time scale of Z. And then when I write down the observation model for Y, a single Y may actually be mapped to multiple latent states. So actually worth noting if Y is just instantaneous but coarser resolution in time, then I'm mapping it to individual Ys just infrequently. But often you have measurement types of measurements that integrate over space or integrate over time. So what was the cumulative discharge through a weir? What was the cumulative flux observed? What was the cumulative number of individuals that fell into a pitfall trap? It integrates over the whole time that the sensor was making the measurement. It doesn't map to a specific discrete time. But that's actually not hard to do. So you might say that the sum or the average of all of these latent states are related to some other state according to a likelihood. Advantages of this is we're working at the full resolution. So we're taking advantage of the high resolution data. Cons is the computation. When you're working at the high resolution, you have a lot more states to estimate. And then potentially identifiability. So if I had a period where I didn't have the Zs and I only had the Ys, then there's potentially an infinite number of permutations on these Xs that are compatible with their sum. Again, that's not a problem if you do have the Zs. And it's also not a problem if you have a strong process model that's linking these that say, you know, I was observing a trend and then I saw the aggregate over that trend. Well, yeah, it's not hard to disentangle that, but if all you're seeing is coarse resolution. So as an extreme, if I only had coarse resolution data and decided to model the model at a very high resolution, I might not have any way of disaggregating. I might just end up with four latent states that are highly correlated with each other and just trading off. I can also do the opposite. I might choose to model at the coarse resolution, in which case I might write down a likelihood that compares this course time scale step to again, for example, the sum or mean of my high resolution data. So that is computationally more efficient, but potentially involves a loss of information because I'm kind of taking high resolution data I have and aggregating it to a course or scale. In this example, I talk about doing this in time, but everything here also works in space as well. So here's an example, not mine, of applying this concept to, for example, dealing with irregularities, polygon data, such as GIS layers, where the observations might be, you know, township or county level summary statistics and you're trying to make an inference about the latent continuous process. So there's some continuous process in which you observe might be, you know, county level summary statistics and you're trying to disaggregate that. Again, it's not strictly identifiable if every single high resolution pixel is independent, but if you make some assumption about the spatial smoothness of that, and it doesn't actually make a strong assumption, it's just saying it is, for example, if you assumed it was a spatially smooth process with some spatial autocorrelation parameter that has to be estimated, you're not just saying what it is, you know, you can actually do that disaggregation and it works even better if you then do have other data sets at other scales that give you information. If you need to dive into this literature, these sorts of things are often referred to as change of support problems in the stats literature because you're dealing with data that are integrating data across different scales and the spatial stats literature is also a whole series of examples in the literature dealing with, for example, combining point data with aerial extent data, whether it be polygon or raster. Other takeaway is that the way that ArcGIS does it is completely wrong. They do all of their upscaling and downscaling without any accounting for uncertainty. You know, Arc would just smooth it and you say, okay, but what's the, the nice thing about the state space approach is when you ask it to disaggregate information, you have the full posterior about the thing you disaggregated. You don't just have an interpolated surface, which can be really important because you don't wanna treat, like if I was taking this map and feeding it into some other analysis as a covariate or as an input, I don't wanna treat each of those disaggregated pieces of information as if they were data because you can result in a lot of false confidence. So, you know, I could set this up at some fine spatial resolution and have, you know, millions of data points entering into the next stage of the analysis. I don't really have millions of data points, I have like 30 townships. So, the downscaling process, so it's changing of support, changing scales can create a false impression of more information than you actually have. By contrast, if we do this in a Bayesian way, not only do we have the uncertainty, but you have, remember when we draw things from posteriors, we draw them jointly, so we might draw a whole row. So you might draw a whole map that would account for the fact that, you know, yeah, would account for the uncertainties appropriately. The next challenge that I've seen a lot when it comes to data fusion is there are challenges with identifiability. And in my line of work, the classic example of this are eddy covariance towers. If you've not seen these before, they're a cool bit of technology. So you set up a scaffold in an ecosystem and it has a bunch of toys on them that measure wind speed using sound waves and concentrations of gas using lasers and it's cool. But it's neat because it can measure the net flux of gases between the atmosphere and the land surface, most commonly the CO2 flux and the water flux. Now, so it might give me the net carbon flux in and out of the system and the net water flux in and out of the system. Okay, but that's made out of a whole bunch of underlying processes and there's literally an infinite way to get that net flux from just, you know, from all the trade-offs among these different processes. So if all you observe is the net, you know, it's very hard to disentangle that. So this is, again, an example where synthesis is very valuable because if I have that net flux, but I also combine that, oops, with a lot of detailed information about specific processes, then I have a way of starting to disaggregating it. And so this is an example where I might have a model that has multiple processes in some observations constrain the aggregate that comes out of it and others constrain parts. A population analog, again, would be something like an age or stage structured model where I might have detailed information about specific transitions. I might have individual level data and then I also might have population level data that's constraining the net overall behavior of the system and they combine both of those pieces of information. This study done by Trevor Keenan a few years ago, I thought was a neat example of exploring some of the challenges of fusing multiple pieces of information and trying to understand the information contribution of different sources of information. So each of these are different data sets. This is actually an attempt to constrain a carbon cycle model at Harvard Forest. Harvard Forest is an LTR in the onsite that's about an hour and a half due west of us in central mass. What we see here is the post-year distribution of our air estimate on a log scale and when he starts with just the first data set it's kind of off the scale. And then here's the post-year estimates of individual parameters plotted as violin plots. And then here is his carbon cycle prediction from 2000 out to 2100. So we can see that there's a good bit of uncertainty in the forecast, a lot of uncertainty in the parameters and a lot of process error. But what he did is he went through and assimilated one data set at a time, kind of that iterative process, take the post-years from one, feed it into the priors for the next. And you can see how when he added different data sets how he constrained the individual, constrained the overall error, constrained individual parameters, some of which still never ended up well constrained and then how that affected the forecast error in the end. Trevor's working with a fairly simple model so he was able to do something that I rarely do, which is he did this kind of as a forward selection problem like he would do in a regression where at the first stage he literally tried fitting every single data set by itself and said, which data set gave me the best constraint on the model? And so the first thing at the top was actually the single data set that gave the best constraint on the whole system. He then said to the n minus one data sets remaining, I fit all of them conditional on the post-year from this and said, what's the second most important data set? And so he's actually, this is why it's called rate my data. He could actually say, what was the order of importance of the data? And kind of as a nice contrast to the idea of the value of multiple constraints, Trevor will also point out that when you get to the end there's also the reality of redundancy. The two pieces of information may be providing you essentially the same piece of, same constraint on the system and while they still constrain parameters, they're not nearly as valuable as additional independent axes of constraint. So he might find, so I think for example, he found that some of the explicit phenological measurements were not as valuable because a lot of the information about phenology was already embedded in the flux data. This isn't the last thing I wanna focus on but it sets me up for what I think is one of the biggest and most underappreciated challenges in data synthesis, which is what happens when you combine data sets that are not equal in size? So if I have the same number of samples from 10 different data sets, I combine them together, I get a decent constraint, they all contribute roughly equally. If I have something like this, this is the NE data, so this is half hourly data throughout the year. So that's, at the Harvard Forest Towers has been running for over 20 years. So that's 17,000 observations per day times over 20 years. This is a very large amount of data. By contrast, soil carbon. How many times are you gonna go out with a soil core and hammer it into the ground? You're definitely not doing that every half hour. Unless you wanna kill your undergraduates or be killed by mutinous undergraduates, you're never gonna ask someone to verify soil carbon on a half hourly timescale. So what happens when you combine something that's measured manually at a small number of samples with something that's measured in an automated sense, on one hand, you definitely want to include both because usually the reason you're combining this additional information is because it's giving you that additional access of information. You're trying to use it to tell you something you didn't already know from the big data, but often your likelihood will end up dominated by the larger data set, such that your model calibration will often just ignore that other high quality manually collected data just because it's getting overwhelmed by the large volume of data. If you look at the existing literature, at least the literature that I found, it's full of ad hoc solutions, things like multiply the likelihoods by arbitrarily expert chosen numbers in order to make one data set more important and the other one less important, or subsample the data to make this lower frequency or average the data to make this lower frequency to bring them into balance. Well, that works, but if you're sub-sampling or averaging the data, you're essentially throwing out all of this information that you actually have, which is painful. So first of all, this one's technically invalid because if you just multiply your likelihood by arbitrary numbers, not only does it affect your mean, but it can have a real big impact on your confidence intervals. It's like, yeah, if I measure 10 observations, but I multiply my likelihood by the number 100, my confidence intervals look great. It's like, well, yeah, it doesn't mean I actually know what's going on. It just means I artificially pretended I had 100 times more information than I actually do. And then we have loss of information. Again, this is somewhat arbitrary. You can in some sense get the answer you want by tuning the degree of averaging. Like, if I don't like this answer, I can try monthly, or try daily, or try weekly, and then at some point I'm going to get something where I say, OK, I've decided that I like the answer I got because it seems to balance these two pieces of competing information. But again, it's arbitrary. So how might we do this in a little less arbitrary way? Oh, yeah, and avoid double-dipping. I've also seen people do things like, actually, I've seen Trevor do it, any day, any month, any annual. Are those independent pieces of information? No, he's put in the same data three times as three different constraints. I love Trevor, but I don't always agree with all of his choices. But he's not the only one who's done that. I've seen lots of people do this. So this animation is kind of meant to give an example of something that I think is intuitive to a lot of people, which is when we have high-frequency automated measurements that may make thousands and thousands of observations per year, it doesn't mean that we actually have thousands and thousands of pieces of information. So here I am starting with a full data set. It's not thinned and thinning it by half each time. And I'm looking at how the autocorrelation changes. And you can also just visually look and see that as I start to, when I first thin this initially, I'm not actually losing much information, even though I'm having the sample size every single time. So the point there is when we put in high-frequency automated measurements in space or time and we ignore their autocorrelation, we're giving them too much weight, partly because the observations are not independent. So here, for that simple simulated example, I'm actually plotting log number of observations, log autocorrelation and saying that there's just this strong trade-off between as I'm thinning how the autocorrelation is changing, such that the effective sample size in the data does not change nearly as quickly as the reduction. And in some sense, it's the effective sample size that's closer to a real measure of the information contribution of a data set. So I guess the point there is treat the uncertainties in data appropriately when you try to combine multiple pieces of information. So things like, when you have repeated measures data, include the fact that it is repeated measures data. If you ignore the autocorrelation in space and time, you will give a lot of these automated data sets much more weight than they deserve. If you treat each observation as independent, you're really inflating. And so that's one of the reasons that these automated measurements can swamp other measurements is because we're treating each observation. If you done naively, we can treat observations as independent and inflate their importance. The other thing that we can think about is when someone goes out and puts a flux tower up or puts any other sort of automated measurement up, and I do have data loggers running out at lots of places, I usually put out one sensor. So I may have a lot of time series information, but I have sample size of one. So at the same time, if you think about it, if I have sampling uncertainty, how do I even estimate my sampling uncertainty if I only take one sample? So that's another thing that's interesting that, again, affects why we need to think about the uncertainty in these sorts of measurements appropriately. So if you have high temporal resolution information that's unreplicated, how do you deal with the fact that you don't know its replication? And in fact, if you believe that there's some overall mean that you're sampling from, and you don't know where you've drawn from that distribution, where you happen to set up your data logger, you can essentially treat that as a bias. So think about it this way, if I had unlimited money and I set up a dozen flux towers, I could get a good estimate of their sample mean and their variability. And it's really that overall landscape mean that I'm interested in. If I happen to have put up that tower, then all of the information I'm collecting at that tower has some bias associated with it. And I don't know what that bias is because I don't have any sampling. If I have any way of estimating that sampling uncertainty, I sure as heck should put an informative prior on it. If I have an uninformative prior, essentially the whole thing blows up. I've never even tried that. But in some sense, that's something you would want to think about. How do you account for that? And then the other thing I want to focus more on for the rest of this lecture is systematic errors. And this gives kind of a hint of it. By chance, I have set up some place that's slightly different than average. But every place I set up is slightly different from average. But there's also lots of measurement techniques out there that are themselves have some errors associated with them. The important thing about systematic errors is they don't average out. So we talked about if you treat data independently, it has a lot of overinflates its information. If I account for its autocorrelation, I can reduce that. But I still have this property that random errors will average out. But if there's also a systematic error, I can have 17,000 observations a year. The random component of that will essentially go to zero at the end of the year. But any systematic errors that I have in that observation persist no matter how many observations I take. I'd summarize data every single second I could take an observation on a process. But if the way I'm taking that observation is biased, I'm still getting a biased estimate. Example from one of the first, actually something happened to me when I was a postdoc, when the first time I tried to constrain a process model. I was working with data from Bartlett Forest up in New Hampshire. And I was trying to constrain that I knew that I couldn't just use the net carbon flux because of the identifiability problem, so I also had soil respiration fluxes. But in the raw data, the soil respiration numbers were bigger than the total ecosystem respiration numbers. And I didn't realize that until after I tried the calibration. And I can tell you the optimal solution to try to make a mechanistic model produce higher soil respiration than ecosystem respiration results in a crazy ecosystem. It basically says there is no autotrophic respiration. There's a whole lot of heterotrophic respiration. Plants grow absolutely insane because they have a constraint that is physically impossible. The data sets have systematic errors. And in this case, you didn't actually know which one was wrong, but you knew that they had to be. They were incompatible with each other. In examples like that, it's obvious, but it's also true that it's not always obvious. And it's definitely true that you get really funky errors because of these systematic errors. So don't forget that systematic errors are present, even when we're only calibrating with one data set, but they often become much more obvious when we're trying to do synthesis. Because if you're only calibrating models against one data set, you can often get the model to fit that data quite well. But when you're trying to calibrate a model with multiple data sets, you can get to a point where that really highlights that there's no way to get the model compatible with both sets of observations. So the last bit here comes from some work I did with David Cameron, the Scottish ecologist, not the prime minister, where I kind of a few years ago started working with a cross-EU cost action, cost actions are kind of similar to what we have as RCNs here in the US. And I found a kindred spirit in someone else who also was obsessed with the challenge of synthesizing multiple sources of information and some of these challenges of systematic errors and inconsistencies and data imbalance. And so one of the things that David and I did was some pseudodate experiments with a very simple ecosystem model. In fact, we named the model very simple ecosystem model. And this is very similar to what you guys played with in the particle filter model. So we just had an NPP process that allocates carbon to leaf and wood. That turns over to the soil. And then we have some turnover of the soil. And then the amount of leaves affects the amount of NPP. Pretty simple model. And what we did is we took that model, we simulated data from it, we then simulated it with the different types of errors in a systematic and random errors in the data and systematic and random errors in the model and explored some trade-offs with balance and unbalanced data. So first, if you have a perfect model, so we didn't put any errors in the model and we have very balanced data, you get a good fit of the model to the data and you recover the right parameters. So the sample size and the different things, so I think we simulated measurements of net carbon flux, soil carbon, and above-ground carbon. And we said we made measurements of each of those things. We made the same numbers of observations of each of these different things. So this avoids that problem of unbalanced data, I was talking about four, if I have a lot of data on one thing and a little bit of data on another thing. OK, next, we created that unbalanced data situation, which we had observed led to some of these problems we'd seen previously with real data and real models, which was we had unbalanced data. The model would fit one of the high-volume data set very well and just ignore the others. Well, it turns out if you have a perfect model, you still get a good fit in the right parameters if you're fitting the right data to the right model. So that highlighted that the problems of unbalanced data are not inherently in the fact that they are unbalanced. We then introduced errors in the model. In some sense, that meant that the model we were fitting the data to was slightly different than the model used to generate the data. So the model that we fit the data to was an approximation of the true model. When we did that with balanced data, we could get the model to fit the data well. But we did not actually recover the true parameters. But if we fit to unbalanced data, we didn't recover the true parameters. We couldn't get the model to fit the low-volume data well. Again, this is exactly what we see in the real world, that with unbalanced data, you fit to the high-volume data and you kind of ignore the low-volume data. So that kind of highlights that it's actually not the unbalancedness that's the problem when you're synthesizing data. It's the problem that, so one problem is errors in the model themselves. You may ask, OK, great. Why don't you just fix the model? Can't actually ever do that. I mean, you can improve the model. But models are always approximations of reality. Models are always approximate. So I can improve the model to deal with some of the known errors. But the model will never be perfect. The model will never be reality. And therefore, if I have high-volume data, and in some parts of ecology, we are getting to big data, if I have big data, I will always cause a conflict between the model and reality at some point, because the model is perfect. So it always becomes unbalanced at some point. So here's a simple example of what this looks like. Model error, unbalanced data, high-volume data. We fit great. Low-volume data, here's the truth. Here's what the model tried to reconstruct. What it tried to reconstruct is driven by this much more than the true observations. This is our classic thing we see a lot, which is models ignoring low-volume information. Then we ask, well, what if the data isn't perfect too? I highlighted that earlier. So we ran a couple simulations. First, model is perfect. The data sets are balanced. So we have the same volumes of information. But we have a bias in the data now. We can get a good fit, but we can't recover the exact true parameters, because the data itself has a systematic error in it. But we can still get a good fit that we can use to make good predictions. On the flip side, if the data is unbalanced and there's some bias in the data, even if we're fitting the model to the true, we use the true model as truth. We fit back to that true model. But if there's some bias in the data, we can't get it to fit right once it's unbalanced. And then, obviously, if we have errors in the data and errors in the model, no, it doesn't work, especially when it's unbalanced. So here's another example. Errors in the model, unbalanced data, errors in the data. Hit the high volume, well, but jeez. We don't even have the direction of what's going on in this term correct. So what we then ask is the question is, what if we build corrections for the systematic errors in the data or in the models into the likelihood itself? So here, linear model meant that we've put in a linear bias correction model into the likelihood. Said the data might be additively biased. It might be multiplicatively biased. So we just put both in and we explored. All of these are cases where it's unbalanced, but we explored where there was an error in the model, where there's errors in the data, or where there's errors in both the model and the data. In all of these cases, we can actually get good fits to data once we include the fact that there are systematic errors as part of the data model itself, as part of the likelihood. We never recover the true parameters when there's biases, but we actually can get good fits. So here's an example, NE, soil carbon, vegetative carbon, low volume data, bias data. We can get things to behave right, and we can actually get the bias correction to get the state right. So some of the take-homes was that, while information content is important to the fusing data, the errors in the biases in the models in the data were actually much more important. Perfect models can deal with unbalanced data, though if you don't account for the autocorrelation, you can still end up with overconfident estimates of your parameters, and then building the bias correction into the calibration and lead to good performance of your model so you can make decent projections, but you're never actually recovering the true system.