 So, I'm going to, if you've had a lot thrown at you, I'm not going to make that any better. I'm going to talk about the emerging world of remote sensing data for biodiversity in the context of the larger space of biodiversity data. But I'm going to talk about some of the sorts of basic issues associated with big data that emerge in trying to do global ecology. I'm going to talk about some of the roles that open data plays. I'm going to show you that it's not just, I'm going to try and argue that it's not just a nice thing that will let the odd graduate student write a cool paper by mining data from the internet, but actually essential to the enterprise. And then I'm also going to talk about some of the different ways in which data can be big. I'm specifically going to follow up on what Nathan talked about with imaging spectroscopy, because that's really the primary type of data that we use directly in the emerging science of global biodiversity. So, just as a little bit of background, JPL is an extremely curious place. It's driven generations of management scientists mad, trying to understand how it's run. I've been there for four years and I don't actually understand the difference between my job and my role. But my function is to supervise our carbon and ecosystems group. NASA has a science focus area called carbon and ecosystems that includes biogeochemistry, terrestrial ecology, marine ecology and biogeochemistry and biodiversity. And so I manage most of the people at JPL, about 40 of them who work primarily in those areas. And by manage, I approve their travel. I try and get them to not compete with each other against each other for the same grant competitions. And I make sure that we have seminars on a weekly basis. But the work that I'm gonna talk about is very much the function of our group. We look at the planet as a whole. We have scientists at the lab who work on biology, on atmospheric and ocean physics. And we don't have any of the traditional disciplinary boundaries. So as I mentioned, the terrestrial and marine ecologists are integrated together both in NASA's focus area and in the group at JPL. I get a focus, however, primarily on our work in terrestrial ecology and terrestrial biodiversity, not really because I don't want to go in the marine direction, although it's clearly less relevant here at NEON, but more just so I can give you kind of a coherent example. I'm gonna focus on biodiversity, which is really an emerging major topic for us because of recent and really quite dramatic advances in technology. Specifically, the ability to make strong inferences about plant diversity and habitat diversity for other types of organisms from imaging spectroscopy data. So our concern is very much in the NASA context, is very much because we see biodiversity as a part of the Earth's climate system. We see it as part of the Earth's system. The functioning of the planet depends on the distribution of organisms with different functional characteristics who are now experiencing really an unprecedented rate of change, both in terms of habitat and atmospheric composition, but also in terms of climate change. And this figure from Scott Laurie et al from a number of years ago is a nice representation. The way to think about this is it's a map of how fast climate zones are moving. If you grew up in a town in Minnesota where the mean annual temperature was 15 degrees, how much further to the north 20, 25 years later would that 15 degree isoline be? And over much of the world, those temperature zones are moving at 10 kilometers a year. Now 10 kilometers a year is 10 times the rate of climate change at the last deglaciation. So that's a rate of climate change that's an order of magnitude bigger than the organisms and ecosystems that are around us today have ever experienced. And even this yellow one kilometer a year is as fast as the changes that occurred either going into or out of an ice age. So one kilometer you might think of as being the upper limit of the natural rate of climate change for the last 800,000 to 2 million years. The fact that it's blue in the mountains or relatively low and by the way these colors are screwed up, what looks like sort of greenish yellow is actually supposed to be blue. Sorry, Andy, I don't know how that happened. That's not encouraging because if you think about climate zones moving in the mountains, if they move a kilometer over, they're gonna move 500, 600 meters up. And there's only so much up that you can go before you run out of up and you slide off the top of the mountain. So in fact, in South America, about 25% of the threatened or endangered species are in these Andean foothill habitats where temperature zones are rapidly pushing organisms into very fragmented and dissected landscapes where the soils aren't even suitable and probably won't be for 10 or 20,000 years for many of the plants to grow that are trying to move thermally up into those regions. So NASA doesn't clearly have a biodiversity mission, but it does have an Earth system mission. And that's the dimension of biodiversity that we tend to focus on. So why big data? There's been brilliant work on biodiversity done in experiments that are led by an individual investigator or by a group of jointly funded investigators. There are networks of investigators that have shared and collaborated data. Why do we need big data? And there are basically several reasons for it. But the first is that the questions that we're asking now, for example, what will happen to the macrozonation of biodiversity to the distribution of species over hundreds or thousands of kilometers? And the way in which those changes to organisms will affect the functioning of the ecosystems that they live in simply can't be addressed with the amount of data that an individual investigator or even a group of jointly funded investigators at typical financial levels. They can't collect enough data. They simply cannot collect enough data to address the time and space scales involved in this. And so the community has evolved a number of mechanisms for pooling data. And I'll go through some of these and the roles that they play. And the other issue that I want to raise and I'll move into this discussion of open data is that the answers that ecologists are finding as they begin to look at large scale patterns and threats to biodiversity are often embedded in very contentious management and policy debates. And as I'll try and illustrate, once scientific data becomes a basis for decision making, whether the data were collected for that purpose or not, transparency becomes absolutely essential. Because increasingly if you think about a multi-party decision making process that might be adversarial, it's simply no longer acceptable in most cases for one group to say, well, I've published the data but you can't see it because it's been peer reviewed and that's good enough. That used to be good enough, it's no longer considered good enough. So transparency, the other role that I play is as editor-in-chief of the journal Ecological Applications where we confront this with typically one or two papers in nearly every issue and certainly multiple times a year. We have a policy that says that as a condition of publication, all primary data must be archived. Now, there are exceptions to this. For example, some scientists will purchase data from national or private organizations where they do not redistribute clause. And that's acceptable. We have a way of accounting for that. We require that the metadata to make exactly the same query be archived. Some data are legally protected about human subjects or endangered species and you have to take care of that. But otherwise, archived data needs to be sufficiently complete that the analyses in the publication can be repeated by an independent scientist. Now, you probably have heard and will hear a lot about science driven reasons for open data, for reuse, for meta-analysis, for synthesis. So the data aren't lost so that people can go back and look at patterns as they may change over time for use in education. There's a whole host of reasons that drive most of the open data movement. But in global ecology and in applied ecology, the overwhelming reason why we cause people to go to the trouble of archiving their data and force them to open their data, which in many cases, people are reluctant to do because it's actually a significant task, is for transparency. That is, if those data at some point become involved in a decisional process, they have no value if they're not accessible to all parties. And so, again, you'll hear a lot in the context of NEON about the value of NEON's open data policy for researchers. But the other really overwhelming reason for environmental data to be open is to make it available for use in decision processes. Now, my experience goes back to NASA's early experiments with open data. I signed up to an open data policy on my first NASA project in 1984. NASA Earth Science formulated a discipline-wide Earth Science data policy in 1991. So the agency has had an absolutely rigorous open data policy for over 25 years and was experimenting it with it for nearly a decade before it. The purpose of the program is to make data available to the scientific community and for the management of our home planet. NASA promotes the full and open sharing of all data. And was the first agency in the US, the first federal agency in the US, and the first agency in the world to institute an open data policy. So we have a lot of experience with it and it's been tremendously successful. It's also unequivocal. NASA is committed to the full and open sharing of science data from all NASA missions, orbital, suborbital, and in situ, including field campaigns. There is no period of exclusive access to the data. The data are released as soon as they are QAQ seed. That is, they're released as soon as putting them in the hands of the community isn't going to create more work than it saves. And in this case, in this regard, the NASA Earth Science policy is substantially more open than the other science disciplines within the agency, many of which still do permit, although they don't encourage a period of exclusive access. And it's a very strong data policy that requires the source code for algorithms, the coefficients that are used in the conversion of primary data to data products, and any ancillary data, for example, cloud cover, atmospheric water vapor, surface albedo, any ancillary data used to generate these processes so that the entire process of data product generation can be independently repeated. The number of cases in which this data policy has created any kind of problem for practicing scientists with a PhD dissertation or with scientific results being scooped or so on, literally you can count the number of those incidents on the fingers of one hand. It's been incredibly successful in promoting collaboration. It's been incredibly successful in promoting the credibility of the data. NASA is the only federal agency whose climate change program is not under direct fire with a conservative Congress. It's been an enormously successful program. And the data are very, very big, but they build on a number of really important additional data sets. And so I'm going to start walking through some of the ways in which we've approached understanding global terrestrial ecosystems, beginning with the mobilize data sets, data sets that are fundamentally built up from measurements collected by individuals or small groups. For example, by the Fluxnet observatories that measure using eddy covariance, carbon dioxide, water vapor, energy exchange, TRY, which is a database again of measurements that have been collected by, in this case, tens of thousands of scientists measuring the traits of individual organisms, mainly plants, things like leaf area or nitrogen concentration or root shoot ratio. Again, data and metadata available online. And then finally the map of life, which is a compilation of species distribution and species occurrence data sets that integrates field studies. It integrates flora and fauna, that is range maps that have been developed by field scientists or by expert judgment. It includes museum specimens together with their geographic information, which varies widely. So the geographic information associated with a museum specimen from let's say the 1840s might say collected in Australia, whereas a specimen collected in the 2000s will typically have coordinates determined by GPS perhaps to the meter. So it's a very heterogeneous data set. But when you look at these data sets, what we learn is that we have an enormous data gap. And I just have to say as a slightly humorous aside that this is one of the few ecological problems that neon will actually make worse, not better. Because the overwhelming patterns are geographic. The areas of the highest diversity and carbon storage, the most life rich areas in the world are in the humid tropics. When we look at the carbon cycle, the other region that's extremely carbon rich is the high latitudes, the Arctic boreal zone or ABC, the boreal forest and the Arctic tundra, where there are huge stores of carbon stored in soils. So if we look at two measures of ecosystem dynamics, this plot shows in the gold line the best current estimate of evapotranspiration as a function of latitude. And in blue, the best estimate of photosynthesis grows primary productivity. The red bars are the number of eddy covariance sites per degree of latitude, not per unit area, although we can convert easily to that. And what you see is that, and you can see it on this map as well, is that most of that type of research is done in the mid latitudes, which are neither where diversity is at a maximum, nor carbon storage is at a maximum, nor are carbon fluxes at a maximum. In fact, from the point of view of the global planetary metabolism, most of our data is in a rather uninteresting part of the world. And of course, neon is going to add a little bit of intensity to these bars. It'll put one site out here and one or two sites up in the high latitudes. And those sites are crucial because they'll be amongst the most comprehensive sites in those regions. Now, when we look at carbon storage, we see something that's even more dramatically true. So this gold line shows our best estimate by degree of latitude of global terrestrial carbon storage, with the blue line showing the amount of that carbon that's stored in above ground biomass, primarily as wood, secondarily in foliage and so on. Here, the red bars show the distribution of global operational forest inventory plots. So these are plots that are primarily maintained by forestry agencies, where forests are sensed every five years, every 10 years, depending on region and nation. And what you can see is that almost all of this data is in a very low biomass part of the world, the mid latitudes, and in a relatively low carbon storage point part of the world. These plots are primarily in wealthy countries where forest management is a developed industrial practice. The unit here are plots per thousand kilometers squared. And so you can see at peak in the northern hemisphere mid latitudes, we might have 15 to 25 plots per thousand kilometers squared. In the tropics, even with the recent expansion of the Brazilian forest inventory, we're down at a few tenths of a plot per thousand kilometers squared. So the vast majority of global biomass and global carbon fluxes is not sampled. Now, we have data there. If you plot points in climate space, you'll see their points everywhere in climate space. But if you think about how heterogeneous these different environments are and you realize that most scientists think that even these densities are inadequate, you realize that we actually know almost nothing about what's going on from direct observation at high and low latitudes. Now, how about diversity? What does the data gap look like for diversity? So this plot is from a recent paper in one of the Nature Journals. And it looks at the distribution of species. So this is the number of species as a function of degree of latitude. But now, instead of showing a simple curve, it shows the range of species per unit area, richness or alpha diversity as a function of latitude. And so you can see at high latitudes in both hemispheres, not only are the number of species low, but the variability of that number between different one-degree grid cells is rather low. These data essentially come from the metadata, not actually even the data, but the metadata of the map of life. And so one parenthetical point that I want to make here is that an awful lot of insight can actually be gained working with good metadata. Without even diving into the data in some cases, we've already learned a tremendous amount. Now, these lower, clearer panels show, in that same space, per degree of latitude, but on a one-degree gridded basis, the number of species for which there is at least one trait measurement. So what this means down here is that there are a few measurements, 5, 10, 15, of at least one plant trait. But they might be nitrogen in one case, root-shoot ratio in another case, chlorophyll concentration in another case, and flowering habit in yet another case. So this just tells us that for a very small fraction of the species present, we know at least something more than the taxonomic identity of the species present. And of course, the biggest gaps in information are in the tropics again, followed by, in this case, the high latitudes. So again, most plant physiologists, plant demographers, and botanists really, you can actually construct a distance metric. And what you find is that most ecologists collect most of their data within a 500 kilometer radius of their research, of their home institution. So we're really homebodies, right? We're in an ecological sense, we're really optimizing our scientific foraging behavior by not traveling too far to collect data so that airline miles per publication is minimized in this case. And yet, what that does is it leads to the most ecologically significant parts of the world being grossly undersampled. Now, some of that is actually challenged, because if you think about trying to get into the interior of the Amazon basin or Borneo or central Siberia, it's actually much easier to go botanize in Ohio or Florida. And that's even more dramatically true for the process-oriented measurements like eddy covariance. So this reflects both our homebody behavior and the tendency of national funding agencies to invest most of their money for research in their home territory. But it also dramatically reflects the challenge of gathering data at high and low latitudes. So that sounds like a terrible problem unless you work for a space agency, in which case it's an opportunity because it's cloud cover and incident sunlight being equal. It's equally easy to collect data more or less anywhere in the world. High latitude winter is challenging, really cloudy places or a little bit harder. But if you're up there for long enough, you're going to be able to get data. So we viewed this data gap figure as being, in many ways, it was viewed as a real challenge and a problem for the community. But from the NASA point of view, from the remote sensing point of view, and I would argue from the neon airborne remote sensing point of view, this is a really good thing. If these hard to sample places had already been really well sampled, then the situation would be completely different. It would be much harder to justify the investments that are needed to collect data using remote sensing. And the argument for using remote sensing in these areas is not so much that scientists are homebodies, but that these are very, very difficult places to work. But they're pretty easy places to just orbit over. Now, this next slide shows the data gap for biodiversity in a different way. These points up here show the central locations of all of the data sets that allow us to get a richness, a regional richness for vascular plants, the roughly quarter million or so vascular plants in the world. So if you want to think about each of those points, you can think about one of them being the flora of Colorado, right? So the flora of Colorado, you can count up the number of species that are listed in the flora of Colorado, bearing in mind that they range from alpine cushion plants that grow at 14,000 feet to large pulpines to ponderosa pines out to opuncia cactus that live in the shortgrass prairie. All of those now are turned into one number. How many vascular plant species grow in Colorado? And that's what one of those points up there is. It's the flora of Colorado. And so these are numbers that are derived from floras, they're derived from monographs, they're derived from museum collections. But those are the number of points that are available for characterizing the vegetation species richness of the planet to produce maps that look like this. This is based on a simple model that relates biodiversity through a somewhat sophisticated regression done by Walter Yetz's group. It basically takes temperature, rainfall, an estimate of primary productivity, a number of other predictor variables that are associated geographically with these points bearing in mind that the flora of Colorado is somehow being correlated with the mean climate for the state of Colorado, which I think if you were to drive up to Trail Ridge Road today at 13,000 feet it would be a little different than it is right down here. It's still be pretty warm though. But when you build a model like that you get something like this and when you use creaking or just statistical techniques to sort of smooth things out you get something that looks like that that has a bit more spatial detail in it. But this is the best that we can do in terms of inferring global patterns of diversity and the data on which this is being based are both problematic. For example, treating the flora of Siberia as if it gave you one number or the flora of Colorado as if it makes sense to lump alpine with desert vegetation. This is the best that we can do and we have absolutely no ability to monitor change because this to build this map requires using data some of which was collected by Linnaeus, some of which was collected by Alexander von Humboldt in South America. So this is some sort of an average picture from the beginnings of taxonomy to the present day and that's the pace at which we collect this sort of data when we go to the field. We have no ability to monitor change except at very very specific locations. So we can monitor change in diversity at experimental sites, we can monitor change in diversity in a subset of the U.S. national parks but we have no idea what's happening for example in the Andean foothills which are amongst the most botanically diverse regions of the world and where rates of climate change are actually very high. So again that seems like for example to people involved in the international treaty activities around biodiversity like a terrible problem and they spend a lot of time worrying about it and trying to come up with indices and proxies and ways of extrapolating from very limited data to much larger areas but to a space agency you know to use to use our systems engineering language this is the second requirement for our mission the first is that it provide global coverage and the second is that it be able to detect change in global vascular plant species diversity doesn't actually have to get either number right has to be able to map global patterns so something that's correlated with richness and if richness changes in some significant way we need to be able to detect it and one way of thinking about this is a lot of what we know about the functioning of a species doesn't come from laboratory physiology again there are roughly a quarter million vascular plant species and something like five percent of them have ever seen the inside of a laboratory typically typically when we make inferences about a plant's adaptation to drought or high temperature or low temperature we do it by correlating with its range and we can make that assumption because we can assume that the present day distribution of that organism is at equilibrium with some sort of stable climate and that assumption underlies the whole business almost said sorry business of species distribution modeling it underlies much of what's encoded into what are called dynamic global vegetation models because for most species and functional types we don't have enough laboratory data or experimental data or even in C2 field observations of process to figure out how that plant actually responds to the environment but if that one of those climatic limiting factors or other limiting factors is moving faster than the species can migrate then its observed relationship to climate no longer reflects its tolerance it's going to be say remember if we say that climate is moving at 10 kilometers a year but a species can only maybe move at one and a half kilometers a year then that species will be out of equilibrium it'll be it will no longer be correlated with a temperature that's meaningfully related to its function and so the longer that we wait to collect global data the less information those data contain and Greg Asner and Paul Moorcroft and I have have written a somewhat controversial paper about this I think because it's been misunderstood but I can I can give Leah the reference if anybody's interested in looking at it's easy to find Schimel Moorcroft Asner and this silly little cartoon which was drawn here at neon is one of the figures in the paper but this notion of inferring function from distribution is really key now remote sensing can't solve this problem by itself but it could be inter it could be interwoven with phylogenetic data which I haven't talked about I haven't talked about the genomic and phylogenetic data that that we also use in these studies plant trait data as exemplified by the tri database the largest in the world and species distribution data for which my research group primarily uses the map of life compilation but remote sensing data could contribute by mapping the functional diversity of vegetation in terms of variables that we can quite confidently now retrieve from imaging spectrometers and airborne LiDAR instruments and that includes things like functional diversity as the summation of a group of variables that includes leaf mass per unit area which is relate to growth strategy fast versus slow foliar nitrogen non structural carbohydrates and a wide variety of of other vegetation characteristics and again we can sum the diversity of those in some multivariate fashion and we can begin to see how how functionally diverse any given region of the world is now this is just a visual representation of what that looks like this is an RGB image of chlorophyll nitrogen and leaf mass per unit area for a monospecific plantation and a tropical forest that has approximately 400 species per kilometer squared and so when you take these functional retrievals you can see that if you sum this into a diversity index you're going to get a low number whereas if you sum this into a diversity index you're going to get a much higher number and it will obviously have different spatial statistics as well and Greg Gassner's group has shown that you can actually go from spectroscopy through a statistical manipulation without actually ever retrieving taxonomic identity you can come up with an estimate of spectral alpha diversity that is the number of spectrally distinct types that are there present within the scene and you can also calculate turnover or beta diversity of that scene now field based alpha diversity that is measured on plots on the ground and spectroscopic diversity even though the spectroscopic method does not retrieve taxonomic identity they're very highly correlated and so if you think about going into an area like the amazon that just to go back a few slides I want to remind you of something look at the number of points in the amazon basin for which there are alpha diversity data there are none right it's a little better than that but basically the entire amazon basin is represented by a single flora and so whatever we know about diversity there has no spatial information we're treating the basin which we know is heterogeneous in many ways as if it were a single green slime with high diversity and so the role of remote sensing is potentially to change this now many of you are probably more familiar with traditional means of remote sensing such as we get from the Landsat instrument which has fewer than 10 independent bands or the MODIS instrument which for vegetation has a similar number of independent measurements whereas imaging spectrometers have somewhere between two and four hundred spectral channels so what's the difference in information if we were using spectroscopy to try and come up with global patterns of biodiversity we would need to be able to sort of identify you know with redundancy and overlap so make the number a little bit smaller but there are 250,000 ish vascular plant species so with five or six independent pieces of information and most Landsat and MODIS research is done with two pieces of independent information how many entities how many different types can we resolve and the typical answer MODIS is able to resolve less than 20 land cover types worldwide that's using the most common MODIS land cover classification which is called the IGBP land cover classification within some of those types you can get a bit of variability by looking at the woody cover fraction but you can't identify alternate types so you can say more or less trees but you don't know if they're oaks or maples so within a type like northern hemisphere mid latitude forests it's just one thing even though there are hundreds of species and they differ between Europe North America and Eurasia so we need a technology that will allow us to resolve let's say thousands of entities and the measure that we use for this and this is me sliding back into talking about big data the measure that we use for this is dimensionality but before I go to dimensionality I want to talk about what's happening to remote sensing data so my mission the one that JPL hired me to work on OCO2 retrieves about a terabyte of data every day right 365 days a year SMAP which was launched about a year after OCO2 is up at about two and a half terabytes of data a year the next NASA India remote sensing satellite which will be a radar is going to be at around 100 terabytes a day or two orders of magnitude more than anything we're doing now if we were collecting imaging spectroscopy with 15 to 30 meter pixels and two to 400 channels we would be in the range of 100 terabytes of data every day every day and so these data are going to be something like four to eight orders of magnitude larger than any biological database except for the genomic sequence databases so first of all this is going to put ecologists in the big data game just in terms of in terms of technology and in an interesting way so OCO2 can actually process all of its data on a cluster that was purchased specifically for the mission although we go to the NASA Pleiades supercomputer when we have to reprocess the entire dataset and we are experimenting with doing it in the cloud SMAP has a mission cluster but their operational data processing for their bulk data for soil moisture is actually done in the amazon and google clouds nesar will have no nasa hardware the data will go directly from the downlink reception station into the cloud and it will be processed on as many processors as are required to keep up with 100 terabytes a day and so nasa is being forced i would say kind of kicking and screaming into a very new era of computing where it's no longer practical or affordable to buy computing hardware to own computing hardware in which to process these kinds of data and where the cloud actually provides a very workable solution i'll come back to another reason why it's so workable for his theory which would be our next generation global imaging spectrometer we're proposing an all-cloud solution the data would actually never go to a nasa installation ever and the reason for that is that these datasets when you look at nesar and hispiri the datasets are too big to move no principal investigator can really meaningfully in finite time download enough of the information to do anything with it and so you have to have the data someplace where the investigator can gain access to computing at the same place as the data are stored so that you're not trying to move these datasets over the internet unless you're taking a very tiny subset and the cloud again is perfect for that because as a part of for example a science team member's funding package they can be given a certain number of cloud computing hours and so when i do this sort of work i frequently use the ncar computing facility i never moved data from ncar back to jpl i use the on-site visualization facility and the only thing i send to shyan wyoming are scripts and the only thing i bring back are images and spreadsheets and that's the model that these very big datasets force you into you you never want to move these datasets around they they they really even with the even with the most optimistic view of how fast the internet may become it's enormously inefficient and economically inefficient to ever think about moving these datasets off-site for analysis and so the reason for moving into the cloud is not only to avoid spending millions and millions of dollars on computing hardware it's also to avoid spending millions and millions of dollars on data transmission it's much cheaper to give investigators or let investigators buy time in the cloud but ecological data tends to be not just big but complicated so if you think about radar you have basically sort of one number per pixel you have a backscatter it's not really one number but it's like 10 or 15 numbers whereas for imaging spectroscopy you have several hundred numbers per pixel and those numbers don't reflect one thing roughness or height or albedo they reflect the underlying chemical complexity and structural complexity of life again if you think about the complexity or the dimensionality of our current big data satellites landsat and modus modus is able to resolve pretty effectively the 16 igbp categories with one kilometer resolution landsat which has 30 meter resolution has a bit more information per unit area for a variety of reasons including its excellent calibration it's able to retrieve directly without using any contextual information the 40 top level national land covered data set categories there are hundreds of categories that are then inferred but you get to those by knowing that you're in Idaho or Vermont right you can't do them purely from spectral information they require additional contextual information by contrast if we want to do global functional diversity we want to be able to resolve some something that is proportional to but probably certainly not identical to the quarter million vascular plant species some of them are just too small to see we're never going to see an orchid in the understory from space some of them may look very much like each other for example even really good taxonomous can't tell willows apart and i can tell you that with an that with a field spectrometer you can't tell them apart either um at least i can't but something like the 10 to 20 000 overstory tree species in south america is close to being something that we could actually resolve and so we need to go from dimensionality on the order of 10 to the 1 10 to the 2 we need to go to order 10 to the 3 10 to the 5 and how are we going to do that so i'm going to show you a dimensionality analysis of a recent avris field campaign where we corrected the data we corrected it for topography because a lot of these data are collected in mountainous landscapes we correct the data using a noise correction term so that we have taken out the sort of raw um noise using a really standard technique and then we do a principal components analysis and so the number of significant that is information rich principal components can be used to create a measure of dimensionality quite directly and by significant in this case we mean components that are significant in that they contain more variance than this noise correction right so we have to have more signal characterized by the principal component than our estimate of total scene uncertainty which is instrument plus uncorrected atmosphere plus uncorrected topography plus uncorrected albedo effects and what we found just to summarize but i'll show you the whole analysis and then i'll show you the somewhat surprising result is that while it's been known for decades that landsat and modus type data have approximately three significant principal components the most recent landsat being just a little bit better than that we typically find 12 to 30 significant principal components in imaging spectroscopy data and so this is analysis that joe bordman and i joe who also works with neon extensively did for jim Tucker who was asking us about it this is what an imaging spectrometer image cube looks like i bet you've seen this maybe the same exact image before the data that we're talking about are collected with an instrument called avris classic which remains the gold standard for imaging spectroscopy even though it's 20 years old and flies in the nasa research er2 often in tandem with other instruments and the data that i'm going to show you come from the hispiri preparatory campaign where flight lines covered about 40 percent of the state of california and i think that there are neon flight lines on maybe this one as well so we use these data which again were collected with a with instrument being extremely well calibrated they were they were collected in places where we typically had at least some sort of ground data where he had a lot of experience and we looked at a combination of the data all along a transect that spanned the state of california the data i'm going to show you it turns out give exactly the same results come from just a small location within california and so if you look at the fraction of variance explained as a function of the number of principal components a good rule of thumb is that 80 percent of variance is where you start losing signal and so at at that so these aren't actually red and green here but if you look at red and green they correspond to either a 780 by 780 pixel subset or a full 160 kilometer transect and they give about the same answer that at the 80 explained level we're at sort of 10 to 12 significant principal components and interestingly that number is is kind of fractal that is you get about the same value when you're looking at a few thousand square kilometers or when you're looking at about one square kilometer and it also is remarkably similar to values that we've gotten looking at imaging spectroscopy over vegetated landscapes in the mega diverse tropics so this sort of 12 to 15 principal components explaining 80 percent of the variance is likely to be a rough estimator that that's robust globally and just to give you an idea this is what an rgb of principal components 13 through 15 look like okay i'll show you what noise looks like in a minute but i can tell you that this is not showing noise it's showing you variation that's associated with the topography and the vegetation of the sea and these are the mountains behind Santa Barbara and this is what a sequence of images looks like so this is principal component one which is basically picking up the large-scale gradient nalbedo associated with vegetation density this is principal component 10 still clearly full of information this is principal component 30 now here you can start to see the speckle that's diagnostic of noise can you see over the whole image there's kind of a black-white alternation that speckle is noise starting to creep in and that again is a mix of instrument noise uncorrected atmospheric noise uncorrected topographic effects and a variety of other things and this is what noise looks like so this is principal component 150 so by the time you get to 150 and remember we have 200 channels so we're going to go beyond 150 there's nothing of interest out here and in fact there's nothing really very usable beyond 30 but what if we degrade these data to Landsat so if we degrade these data to Landsat we take the 200 spectral channels and we convolve them into the Landsat spectral channels what we find is that at the 80% variance explained level we have two principal components right so the dimensionality here is going to be something like the ratio of 2 to 30 but I'll show you in a minute that it's much more dramatic than that contrast and there might be signal out here remember we found with avarice that there was signal beyond the 80% of variance explained and indeed there is so this is an RGB composite of principal components one to three for the data convolved to the Landsat bands and this is a composite of four to six so we're not even close to the noise threshold Landsat has not exploited all of the information because it doesn't ever get to noise you actually want the instrument to go all the way to noise to be sure that you've actually captured all the information but clearly 80% is too conservative because really this appears to be about as signal rich as this is so here's the point if you think about defining land cover types or vegetation types or plant functional types in principal component space you want to take the first principal component which let's call albedo that principal component if you ratio it to noise can probably be broken up into a hundred chunks that are bigger than the noise threshold right so you could have a hundred different levels of albedo on that first principal component and in in this case albedo is driven almost entirely by chlorophyll density the second principal components a bit smaller so maybe it only has 30 components so now we have a hundred times 30 possible categories right so we have 3000 possible categories if I add another axis let's just say it has 10 levels now I have 30,000 possible categories if I add another let's say it has 10 meaningful categories again each time I do this I'm adding a power of 10 it turns out that this conversion from the number of principal components into the number of resolvable categories is approximated by the number of principal components factorial so if you have 30 meaningful principal components the information content of that the theoretical maximum information content of that is approximated by 30 factorial which is so big that you actually can't write it down on my computer screen whereas if you're using Landsat the information content is approximated by four factorial so you can see that the that the information content just using noise corrected meaningful information of with signals that are larger than the noise threshold the the difference is in the tens of orders of magnitude difference and so this is the theoretical basis for which imaging spectroscopy can approximate the diversity of life on earth now it turns out that most of that enormous space defined by 30 axes is empty plants don't have high chlorophyll and low nitrogen they don't have low albedo and high chlorophyll so most of that space is empty most of that information can't be used but still the proportionality between the radiometric information content and the spectroscopic information content is on the order is in the orders of magnitude so it's not proportional to two relative or four relative to 30 it's more like proportional to 30 relative to a million so the information content of spectroscopy can at least potentially meet some of our needs for looking at the diversity of life so just as a as a summary the high information content of spectroscopy and we're talking now about high performance modern-day spectrometers that actually have very low noise these instruments have signal-to-noise ratios of a thousand to one or so over vegetated surfaces they allow for discriminating a huge number of surface types thousands tens of thousands of surface types with some degree of confidence they can easily define hundreds of distinct plant functional types defined in terms of their chlorophyll nitrogen leaf mass per unit area lignin non-structural carbohydrate concentrations and our new algorithms can exploit this high dimensionality but they do require big data approaches so if you think about doing one of these analyses you're taking a matrix that is pixels in the north direction versus pixels in the east direction times spectral dimensionality for the window size that you wish to analysis analyze and you're inverting that matrix right even though it's sparse these are enormously time-consuming computational approaches and so we really need we really need big data approaches and we also need big data management configuration philosophies average vegetate avarice vegetation data potentially have 30 factorial degrees of freedom that is 30 factorial number of resolvable categories the actual number is something like a thousandth or a ten thousandth of that which is still an enormously large number in the millions and the the range in which we're looking is degrees of freedom that are somewhere in the range of 10 to the 12th to 10 to the 18th compared to 10 to the 2 or 10 to the 3 from traditional remote sensing data we're able to develop standard products that take advantage of this high dimensionality this is leaf nitrogen and leaf mass per unit area from two of these hispary preparatory project flights and again these products use substantial fraction of the information contained in the spectrum we're also able to produce extremely valuable operational information this for the Sierra national forest is a map of mortality that occurred during the drought so this differences the fraction of live and dead vegetation in 2013 with the fraction of dead vegetation in 2015 we'll be doing this again with data that we're collecting actually this week in fact maybe right now in fact I really hope right now and so this allows us to look at ecological change remember being able to look at change is crucial so here at least we can see this is only a couple tree species but we can see them dying which you know I hate to call it a piece of good luck that we did this campaign over the course of the worst drought the continents experienced in 50 years or so but from a scientific point of view it's been a fantastic experiment and we're doing the data analysis for this on NASA's Pleiades machine Pleiades has at various times been in the top 10 supercomputers in the world I think it has 250,000 active processors right now and to process the average data set a single snapshot is about a billion pixels so each time we fly California we do it three times a year each time we fly California we capture about a billion pixels each pixel has something like 220 channels so 220 billion pieces of information per time three times a year times four years so in order to process that in again finite time and finite time looks like it's going to be about a year actually we'll be running on between 10 and 30,000 processors on Pleiades to do those matrix operations so just to summarize talking about big data understanding the changing environment requires this kind of wall-to-wall information you can gain process insights at individual field sites but you don't know what's happening in the next county over and we do know that it could be quite different either because the rate of climate change is different the soils are different the forests are young instead of old wall-to-wall data can come from infrastructure it can come from networks like neon it can come from remote sensing it can come from one of the most innovative types of infrastructure which is citizen science we know that citizen science is a great way in certain parts of the world to get data over huge areas and over long periods of time it can come from data sharing and mobilization that is to say acquiring data from museums acquiring data from research collections that are held by individual investigators it can come from mobilizing data from field and process studies that are sufficiently similar that they can be integrated into data sets but it can be wall-to-wall data can contribute to questions that simply cannot be addressed any other way and these questions are increasingly on people's minds both for basic understanding how it how does the planet work and how is it changing and how can we manage it where are the areas of highest risk where are the opportunities for intervention and finally big data may be high volume or high dimensionality or both and imaging spectroscopy and genomic data are the two sort of type examples of data that are both very large in terms of terabytes or petabytes or exabytes but which are also very complex in terms of the structure of each individual record thinking about open data open data is crucial for the global science enterprise because the sorts of analyses I've shown you which initially we're just of where we have data and where we don't we couldn't do that if people weren't mobilizing their data and metadata and it's really only in the last year to three years that people have posted sufficient of that sort of metadata that we could even assess what we knew and what we didn't open data allows mobilization of data towards these big questions and that's really crucial even though data sets like try and map of life are in some sense too sparse they allow us at least a first look at patterns that we had really no idea about before if you think about those carbon storage figures where I showed you the results and the distribution of something like three million forest inventory plots the first publication that tried to come up with a global terrestrial biomass inventory was done by Robert Whitaker actually and it was based on 17 sites okay so since the 1960s we've gone from having 17 what I can now say confidently are highly non-representative sites to something like three million sites three million sites to several billion satellite measurements is the next step using radar and lidar open data I really can't emphasize this too much but increasingly we're all going to be dealing with the fact that open data is essential when science data are used for decision support there's a quote that I found when I was doing research on this from a troll actually on the internet but he wasn't wrong he said data that are not readily available should not be viewed as credible in decision making right that is the emerging view and so even though this guy was a troll he wasn't wrong about that and that view is held by many and in any case if you get into an adversarial environmental decision there will be trolls involved and the only way that you can actually defeat that argument is for the data to be readily available the greatest case study of this is when all of the global temperature data were turned over to the sceptic Berkeley climate project which intended to disprove the observed pattern of global warming that had been put together by NOAA the Hadley center and and NASA they found that they got exactly the same pattern and they and they were unable to disprove the temperature trend over the past century in fact they were able to confirm it with greater statistical confidence because they had access to all of the data finally so open data is crucial open data is expensive right and and I want to make this point too because it's expensive for a network or a principal investigator to make their data open it's not just a good thing to do it's not something that people used to not do because they were grinches and they wanted to maintain secrecy so they could write more papers the big reason when you do surveys of scientists that they don't open their data up is because it's a significant additional task to publish high quality metadata to clean the data so that it's suitable for online publication or other types of publications it's an extra task but it adds value to the science enterprise and it's extremely important for networks and funding agencies to understand that you can't just say oh and by the way you need to make your data available it is a significant cost and particularly when that cost is taken over thousands of investigators it's a meaningful cost the question is how do you account for the value added of that open data which raises the question do all data need to be openly available right now we don't have good decision rules for which data have value and which ones probably are only of value to the publication they support but given that like five percent of publications accumulate 95 percent of citations it may well be that there are algorithmic ways of figuring out which data sets are worth putting in the public domain and which ones are worth holding onto in some lightweight sort of a way so does all data need to be open and how can we tell and we really don't know the answer to this there are enough just so stories where raw data was lost or hidden or poorly documented and then became vital that right now our our default policy is just to say we can't really predict in advance which data will have added value in the future and which ones won't so we're going to require that they all be available and and that's leading to two things one rapid growth in storage charges and two it's almost certainly leading to the publication of many data sets that are very poorly documented and are not in fact actually reusable and so this is this is sort of the next question after you sort of philosophically get your head around the idea that the default ought to be for data to be to be open the next question is well how open and do they all really need to be open so i want to wrap up there going back to the to the blue marble and and remind you that that the reason that we need big open data is because the phenomena that we're studying are interconnected they're linked by the physical circulation of the atmosphere in the oceans these two continents are linked by the migration of countless organisms that fly between them or swim between them we now are stepping back and beginning to look at scales of time and space where those connections can't be ignored but where they in fact become central and i think that neon is a visionary experiment in doing that at an extremely large although not global scale and it's widely viewed in the community and this may be more visible from the outside than the inside it's widely viewed in the community as a really visionary precursor to an integrated global network of networks tied together by remote sensing to measure in between the detailed measurements that we of necessity make on the ground so thanks y'all that's where i want to finish up