 Well, thanks Kevin for inviting me here, and glad to see all of you here, showing some respect to what I have done so far, although that does not mean anything. So yeah, so what we have been trying to do over the last four or five years is create this platform here at NASA Ames called the NASA Earth Exchange, and what the whole idea was that you wanted to bring the research community and do cool stuff essentially, right? So what we have seen over the last 20 years being coming from a research background, obviously, during my PhD I almost spent almost 80% of the time actually trying to download data, you know, filter the data, probably decompress the file and stuff like that, and then 20% I did real research. So my advisor was not happy with me, so he was always coming, you know, every ten minutes is like anything to see? Every ten minutes, anything to see? And then so I had to tell him that, you know, it's very difficult, you know, at that time we were dealing with Landsat data and stuff like that, you know, you had to do all this atmospheric correction and stuff like that, and then I had to say that okay, it takes time, so and I don't have the codes ready yet, and we don't have any open source codes right now. I'm talking about ten years back, right, so today I was wanted to talk to you about what we have been doing at NASA Earth Exchange, and what kind of big data problems we're handling, what kind of high-performance computing architectures we're using, and what are we doing in terms of machine learning, because it's getting more popular nowadays, and I was telling Kevin right now that the last 30 years actually, we both actually have been doing machine learning, although we didn't realize we have been doing that, right. It started with unsupervised, supervised classification, long time back. So this is something interesting, I was thinking yesterday, should I call it big data or fact data? I like that word fact data, so it's lots of 3D stacks, it makes it look fat, or it's intelligent data, or it's large data, you know, large matrices, whether it's massive data, 4D stacks, so you can use all sorts of words, and I found an acronym for it, film, and yes, it's a lot of data too. So then I was thinking the other day, you know, how about cloud computing, let's use another objective for it, thunder computing, I didn't hear it anywhere, so it's a copyright, and then we have high-performance computing, which I still like, I didn't have any other acronyms for it. This I showed actually last time when I was here at GIF, but it's the same thing it still holds, whether earth science data is still big, right, and it is big, model and analytics are core components of dealing with earth science data, data center infrastructures, we have been dealing with it for a long time now, how actually you bring in all the data from the satellites to data centers and then bring it to different centers like NASA, USGS, you know, we're talking about a lot of bandwidth here, and then high-performance computing, solutions are necessary for actionable intelligence, that's what I write here, because slowly we're realizing that we're into this deep data space where we need to, you know, implement algorithms, you know, much faster and efficient manner, right, so you need a high-performance computing infrastructure, it can be a public cloud or your own private cloud, and then live data visualization, right, which is very important nowadays to really make sense out of the data. So the challenge here is storing more data, accessing large quantities of data faster, understanding better what the data tells us, whether it's structured or unstructured, and integrating efficiently, you know, cloud computing solutions and data access methodologies, it's a very basic here, a very high level, but essentially following industry standards, right, so can we really solve it, yes, we have lots of tools now, lots of techniques, everyone kind of knows how to do these things if you're a programmer, and I don't want to go into details about all these things, I mean, you guys are already doing a lot of this stuff here at BIDS, so what are some of these techniques, and access and sort data in an efficient manner, you know, we can deploy open source solutions like Apache Hadoop, MongoDB, to create all these fancy data structures, right, improve legacy infrastructure to meet demands of real-time analytics, scale out solid state storage areas are becoming big, and then you can deploy cloud computing solutions, obviously NASA aims, you know, we all know that we came up with OpenStack, and we still have it running there, AWS, Rackspace, you know, Microsoft Azure, whatever, and then you have efficient learning algorithms, that is what we have been trying to do for the last two years, a lot, you know, implement scalable machine learning algorithms to deal with satellite data, and then visualize the data also, you have a lot of open source solutions, you know, Tableau, high charts is open source, and then what essentially NASA Earth Exchange is, is provides science as a service to the earth science community, and we provide a lot of code, a lot of data, almost in the order of petabytes now, we bring in all the satellite data from all the land imaging sensors, be it Landsat, Modis, now we are actually bringing in the Sentinel-2 data also, which is a new, you know, they're getting pretty good imagery, it's a ASA-based European Space Agency satellite, and we have all the climate model output data, we have different models running on the NASA Earth Exchange, and all of this, not just providing access to code and data, but we are also creating workflows, so if you want to, you know, recreate a specific workflow, which is in a given article or a paper or a manuscript, you can use that workflow to recreate the whole process, and then create results in no time, essentially provides access to wide variety of ready-to-use data, provides the ability to bring code to data, and next offers capabilities for reproducing science through virtual machines and scientific workflows, so we are also creating a lot of these virtual machines using, like, Docker kind of interface, right, where you could actually take it and transport it to a machine and then do stuff, and then also it offers state-of-the-art advanced compute capabilities using our pleady supercomputing facility. It's just a repetition, I want to go ahead, and then what we have been trying to do is, it's more of data to knowledge, right, so we have this, we started as a small group, 10 years back at NASA, it's called the Terrestrial Observation and Prediction System, it all boiled down to this, and everyone follows the same model nowadays, if you're talking about commercial ventures, well, we have all this ground-based, air-based satellite, you know, climate data coming in from different sensors, IOT coming in nowadays, you know, from fields and agricultural fields, it's nothing new, it's been done for a long time now, right? It's all about how you could actually bring this data to a centralized platform and, you know, put these models and create sense out of it, right, so we had these ecosystem simulation models, old-age models, like, you know, crop VC model, you know, crop, how the crop is doing actually, leaf area index, stuff like that, right, we have been running these models for a long time. That time we were having a lot of problems, how we could actually handle different metadata types, different satellite data sets, different weather data, these are sparse data sets, right, you know, you see a weather station data actually is completely switched off for a long time, and then how could you actually use meaningful information, so what we had to do was, okay, let's do a monthly composite and then use the monthly composite number, and then you could match it up with monthly values from satellite data, then you have one single dimension, there's no sparsity, right, and then you could create a model and then come out something with it, cool, but again, you know, it's essentially now we're trying to adapt to new realities, so we have a lot more data, a lot of high-resolution data sets coming in, people want to gain insights at a much more granular level, so we have been talking about agricultural yield prediction for a long time, well, people say you can't do really yield from Landsat, essentially you can't estimate yield from NDVI, and NDVI is one of those satellite indices, right, which we calculate from the near-infrared and red, so what we have been thinking about is, okay, why about how about using very high-resolution data, and then take ground data with some kind of a mobile interface where we take pictures of crops and then, you know, blend it with satellite data in a machine learning framework to predict yield, so quite complicated, it seems like that, but you know, you can never have yield the exact number that you desire from satellite, that's not possible, right, and this was essentially the drivers for this kind of NEX initiative, researchers spent a major fraction of time as I was telling you, finding, ordering, waiting, downloading data, you know, I have my friends long back, you know, from India who are still in research organizations, they've been telling us, oh my God, how do you actually download Landsat data, it's impossible, we probably download two scenes in a week's time, it's the same with the US also here, if someone is using the bulk download tool, it takes forever, right, to download Landsat data, and then moving data says that they're getting larger each year over when is getting expensive, right now actually, today morning I had a telecon with the USGS and ESSA, we are actually worried about what kind of storage requirements we need and what kind of bandwidth we need to open up for getting all the Sentinel-2 data, right, it's a lot of data coming in, and there's no standard mechanisms for transparency and repeatability, and this is what I mean, right, if you have a paper published in Science on Nature, no one knows how to replicate that whole process, and that's what we are trying to do, and I highlighted in yellow actually, it's culturally local access is how science is done, so everyone will have their own desktop machine, download the data, run their own algorithms, and that's how science is done, right, so we want to get away from that, so this is the present state of art for NEX, we have data over 800 terabyte on a near line, and then we have almost 1.3 petabyte storage, a lot of these data states are populating now, we're getting new storage, we have to actually transfer some of the data to tape storage, because a lot of people are not using some of the data, the old Landsat data especially, we're increasing our sandbox environment, our sandbox is more like a prototyping environment, so you have a 96 core server, which is pretty good for us prototyping, you know, phase, and obviously then once you're happy with your prototyping phase, you can scale it up to the high performance supercomputer which we have, you can actually have, you can run a job accessing almost 100,000 cores, so that's pretty cool, right, so we are processing 6,000 scenes of Landsat in a day or so, again it depends, whenever I'm saying we could do all this great stuff, there are caveats, right, so you're always waiting in a queue, so unless your job gets into the queue, but once you are in the queue, and you know you get the allocation for the next day, probably waiting for 24 hours, but after that you get the whole thing working out for you, so, but that still saves a lot of time, and then we're bringing in a lot of these satellite data states as you can see here, we have model data sets, we have Glass, Prism, DayMet, we're getting all the high resolution data sets from the National Agriculture Imaging Program, NAPE, we are actually buying those data sets because, and it's kind of strange, we spent a lot of time transferring data from the small drives they send us, one terabyte drives, and USB 2, and it takes forever, so we had to build this nice stack, we built it on a buyer hand, this nice stackable, you know, drive system where you put all these hard drives, and then we have these wires coming out from it, and it basically syncs up, and we figure out which drive is failing, you know, at what time, but it's syncing up all the data, it takes forever, so for one year of data, for the whole of US, one meter, we spent probably three months just transferring the data from those one terabyte drives, right? If you want to order it through FTP, forget about it, take forever, so we are actually getting all these NAPE data sets from 2010, they really have good data, this is a multi-spectral RGB and IR, one meter, really good data, clean data, and we are also getting a lot of this digital global archive through our connection with NGA, NGA actually gives us all this data for free, but we can't redistribute it outside, so it's like a constellation, right? You have world view 2, world view 1, rapid die, everything is there, but they don't have a standard database structure which you can use, so you have to create your own database first, and then you have to start working with those data sets, and everything is like, you know, you have to find the projection files for them, you have to re-project, and as you know, we have been, NEX has been releasing this downscale climate projection data sets at a high resolution now, so we are making them available, all of them also into the Amazon public cloud, which I'll talk about later, scale it up, yeah, I mean I wanted to give some examples on the classes of next big data projects, and this is how we categorize them, essentially fully distributed data processing with no inter-process data dependencies, right? And we're talking about data sizes from hundreds of terabytes to five petabytes, data mining with some inter-process data dependencies, we're talking about 300 terabytes to two petabytes, and then analytics and science applications, we have a lot of applications folks who are running on NEX, you know, right from irrigation to mapping fallowed land to forest service people, so you know, they are, they mostly require database query systems, and we are talking about one to ten terabytes, everyone using Landsat, and then provenance and knowledge graph queries, we're talking about 100 million to 1 billion triples in 2015, and then we have climate and ecosystem modeling, a lot of those research folks are trying to do, deal with this kind of stuff, and these are essentially NASA funded projects, like mostly funded in the NASA ROSES program, and some of these processes are actually computationally intensive, and so we're talking about 2 to 20 terabytes, not much, but still, we have to support these projects, and then I wanted to show some of the projects which I have been involved with helping them scale up their problem essentially, and the first one is about machine learning for anomaly detection, I wouldn't say it's more machine learning, but I put this in context because I'm slowly moving towards that, this is actually one of those studies which we did for a long time, this has been a controversial study about, you know, how the vegetation in Amazon actually responded to the droughts in Amazon, right, and there have been a lot of papers, so what we wanted to do was capture these workflows in these papers, all the datasets they're using, be it trim, be it MODIS, and MODIS has different datasets from radiation to temperature to NDVI to EVI, so we capture this process and then provide this whole workflow to all the researchers, right, whoever is doing this Amazon study, so everyone was trying to use this workflow and coming up with different results, so it's a very interesting process actually to see, and I have seen this over the time, if you use a MODIS reflectance product and the MODIS N bar, which is also a reflectance product, BRDF adjusted, and you do this analysis, you actually end up having two different spatial maps of anomaly, and then you have other complexities, so what's going on, what you wanted to do was, okay, you know, we made this available to all the researchers, we have this code, you can scale it up for all the MODIS styles, just do it globally, you know, it probably takes like half an hour right now to do this global analysis of MODIS, again we are talking about 10 terabytes of data and 1 million for 15 years, right, total number of scenes, MODIS scenes, this is some of the images, the drought monitoring, and this was one of our other projects, which was funded under the NASA measures program, which is the five years funded project, and this is about creating monthly mosaics of global Landsat data, so and people have been using this data for different purposes, creating just monthly mosaics of Landsat is a big deal for me, especially when I come, when I look back five years from now, I couldn't have done this without having access to this kind of facility, right, and it doesn't require much, I mean it just requires the necessary tools and techniques which you use in your day to day life, but there's a scaling part takes a little bit of a coding, right, that's the only, only, only aspect, and we are actually prototyping a lot of land products from this mosaics of monthly mosaics, especially LAI, F-par and Albedo, going from where we have been involved in this North American forest disturbance project, where we have been using this Landsat tiles for the whole time stack from the 70s till now, and this has been one of our biggest projects so far, involved a lot of computation, so processing 96,000 scenes from 85 till 2010, and we actually created this disturbance product now as part of this project, it's available on any X, if you want to get a hold on this data set, please contact Sam Goward, he can actually give access to this data set, but again we're trying to sort out means on how to distribute this data, because either we have to do it by UTM zones or you know split up into modus kind of tiles like well does and then distribute the whole thing, this was actually detecting forest disturbance, this is just another you know you know just showing how the scenes come in, and then one you know just to put this context 4.3 billion pixels, if it was a commercial company right here, that would have been a great pitch, 4.3 billion pixels classified as forest in one hour, 434 path rows, 29 years, several scenes per year, 16 hours wall time for 12 cores and 1736 nodes, so almost 20,832 cores, right, and then these were all done using our packages and custom parallel wrappers you know, which were written in Python some of it, this was some of my stuff which I did two years back, creating 30 meter resolution Landsat, Leaf Area index maps, it's still there for people to download this data set, it was done for one or two times timesnaps actually, just showing the potential of doing Landsat based LAI, and if you see all this right hand region here, you could see a lot of heterogeneity which has been captured in the Landsat scale, if you actually see the modus imagery of LAI you won't see this, I mean that's what I wanted to you know showcase here, that you know unless you use a high resolution information you cannot get anything useful out of it, and then I've been involved with this NASA carbon monitoring system for the last three years now, and we have been working with tons of sensors to come up with the best possible estimate of biomass at the highest possible resolution, and we're talking about 30 meters again here, 30 meters is good enough for the policy and stakeholder, so but what they wanted to understand the uncertainty associated with each of these pixel estimates, so we created these uncertainty maps for them and the carbon monitoring system is still running now, it's a multi-year funded project, a lot of teams are involved in this, we're trying to do high resolution, my last project which was funded last year was about doing biomass at one meter, and that's what you've been trying to do, create a whole one meter continental mosaic of tree cover for the whole of US, and once you have the tree cover estimate, you can do biomass because you have all the other metrics coming from LIDAR and stuff like that, and you have all the all the metrics all in place, essential the biggest uncertainty is the forest cover, global one meter tree cover map, I have not seen that anywhere, oh it's a big undertaking, I'll tell you about the complexities because we're coming to that, some of the machine learning applications we are doing, and then we have this, we have this funded project from the National Climate Assessment NCA, we have been supporting NCA for a long time now, what we're trying to do here is create high resolution climate model projection datasets, and you know this is where we actually first collaborated with Jeff, Kevin, Maggie is here, and Nancy is now actually working with us also, we wanted to visualize these datasets right, and visualize in a way that makes sense, so when I create a pixel you know I show this map, it should automatically query all the time series data that has, it's a big dataset we're talking about, it's lots of data, we have multiple projection, we have multiple models, we have multiple scenarios, but if there is a nice visualization platform, people can come in and you know make sense out of this data, and then you can have a you know an API level interface where you could actually download those datasets right, if you're given a given a lat-long coordinate, if it pulls the certain time series, you can actually take that to your system and then do some analysis, so these are some of the stuff you know that comes out of this and this shows temperatures you know how it looks like in 2100 from 2000, and then we have been involved with the California Department of Water Resources, and what they wanted was can you actually map the fallow area maps for the whole of California in the last two years with the drought right, so what we did was okay you know given all these Landsat scenes we have we actually map for you what are those fields which have been cultivated which have been actually fallowed or emergent or actually we have two seasons, so we also you know give them regions where it's summer idle right, so and this has been a very useful information for them, they didn't want real-time or near daily maps they just wanted snapshots going forward we had this open NEX which is the open NASA Earth Exchange Initiative, we partnered up with Amazon we had a space act agreement with this it's a no-cost kind of a space act agreement where Amazon provides free storage for all of our data sets, so we have been you know putting a lot of these NASA Earth Exchange data sets onto Amazon S3 and it's part of the public data program so you all of you can actually download those data sets the DCP 30 and the global downscale climate projection data is also available on Amazon now and the core features were essentially we had a concept of providing virtual labs I think Maggie was there we gave a lecture last year part of our lab series and then we have all the data coming on S3 we have on-demand computing from EC2 using Amazon compute instances what we do essentially is we create all these virtual machines you can spin them up and then you know run those virtual machines accessing our data pool from the public data set right, so you can do a lot of stuff only problem is at some point when you're doing a lot of this processing you have to pay your own money which is not much if you compare your private you know storage requirements and computer comments you have a really big challenge if this actually works so we had this OpenNEX workshop last year almost 600 people participated in this workshop it was a challenge open challenge and what we told them to do was create some kind of tools using all the AWS stack and that involves anything you want to use you can use the database you can use their computing sensors you can use spot instances whatever but use our data sets to do something useful and we what the problem was create climate resiliency tools because we had to satisfy this White House because it had to be aligned with the White House initiative right so the whole idea was okay create this web apps or mobile apps or you know web dashboards or tools which relate to climate resiliency or you know I mean the example we gave was you know what would New York look like in two thousand hundred you know whether it will be like Arizona right it's a twin cities concept and we had some extremely extremely innovative submissions and there was two phases the first phase was the ideation the second phase was more of implementation in the implementation challenge we received almost 30 30 replies because that had to do the whole we had to implement the whole thing right so there are very few people who participating but all the replies were very interesting and we had some very very good ideas and then we had to justify both Amazon and NASA that why do you want to actually go forward with this model so what we came out with this okay now you have all this NEX data essentially NASA open source data some of you know some of the data is being produced by researchers it's still open source and you have this compute platform and then you have big science being done right so scaling on the cloud is a must I mean it's the only option right now you have to do that what are the components associated with it yes if you have reduced cost you have open science you have data discovery and exploration you have on-demand scalable computing we have been using a lot of the spot instance model extremely interesting and then you have open open resources which we provide as labs and tutorials we're trying to expand on those tutorials now and then there is low barrier of entry because whenever we're talking about access to high performance computing inside NASA aims you have all the security restrictions international people cannot participate unless they're associated with NASA projects in some form right so you have all this barrier of entry so what are the new features we have open NEX new features data discovery data visualization which a lot of the stuff being done by GIF and then we have virtual labs tutorials lectures sample example is this essentially shows shows our open NEX stack that's what I call we have a front end we have a back end and then we have all the AWS stack that's not what I'm doing it's kind of tight actually just let's fast forward a little bit I don't want to show this this is our typical web platform which we have for the open NEX again these are some of the new open NEX features if you have been following the open NEX site I don't know how many people know about this but you've been trying to revamp the whole site and that essentially involves you know again knowledge capture usage more of analytics recommendations you know standard stuff nowadays right so this is what I wanted to show you know what the geospatial innovation facility Maggie Nancy and Kevin has been you know Kevin initially started this and Maggie has started this so they've been trying to build this interface it's a it's a visualization platform for our NEX open NEX climate projection datasets and it's pretty cool actually I love the interface you know you have a lot of options to play with because the data is so big and the query is so hard even taking less than 20 seconds to query the data for each pixel is a lot for us there's some cool stuff going on behind the back end when this is basically spitting out all the time series given an idea of interest you could actually move back and forth from 2000 you know present till till 2100 so everything is there at some point give is going to release that probably in their their website and we're also going to integrate it to our open NEX website so going to happen very soon and this is our other chip model planet OS where we're doing a lot of search and discovery which which we had been lacking actually for the last four years we have this lot of datasets but whenever people come in they say you know given a region of interest given a time span you know I want to just use this dataset and then feed it to my code and that should automatically happen every time I do that and so you know and we created this interface okay given a variable of choice temperature it spits out all the datasets and a given of region of interest it spits out whatever datasets are available on any X creates a nice JSON file for your fields it to your program which is running behind the code right and then it automatically works right so you don't have to do it manually now and discovery needs to be fast so if I'm talking about filtering the modus datasets now that's an interesting question which you have dinner which we did actually right now in our discovery platform is given a modus dataset you have qf flags right I have aerosols I have clouds I have you know shadows coming in so filter out all the pixels which has higher results high clouds and you know high cloud shadows so how do you do that I mean very easy we actually pre-can all the datasets initially for each of those pixels it's basically sitting like on a side db kind of a database and then you could do this query and then create the dataset for you to use right so and now I would like to show some of the machine learning stuff which we have been doing although I have probably 15 minutes I would actually skim through some of the basics of machine learning which every one of you know here what we have been trying to do is derive three cover from one meter multi-spectral datasets right and this is essentially the native dataset again I was telling you it's a beautiful dataset but you still have to address the scale scalability part of it so we're talking about 60 terabytes of images nothing but the granularity is high right so we're talking about 330,000 scenes each scene is like 7000 by 7000 in matrix dimension it is big data what we did for now and we actually scaled it to three other states we don't have the results now but for california it takes us 48 hours to run 11,000 scenes and mosaic it and then create the tree cover but obviously the interesting problem is tree cover delineation is a hard problem and when you're talking about this problem you're talking about a lot of heterogeneities you know mix pixel problems you know you're talking about whenever you're talking about using multi-spectral bands and you're talking about what kind of features you should use and whether those features actually help whether they're complementary or whether you have a better way of doing this so we created this deep learning based image image classification technique for doing this and what are we really using in terms of features to classify this and we had to split this up into three different parts this is actually a processing architecture on the supercomputer end so we have this image is being fed in parallel to all this course and again split up into small images we have three parallel routines running one is the segmentation one is the classification and one is the feature extraction part of it the classification is essentially deep learning component and that is being split into an unsupervised and a supervised we also did this on amazon this is kind of what it looks like in the amazon front it's a it's a complicated we can discuss this offline but but essentially we leverage a lot of spot instances almost 70 percent savings for us right so I'm not going to the system requirements part of it but let's go into the algorithm we need segmentation after we do the classification that's for sure this is some of our features which we have been using from the native data set again we can discuss about this why we have been using these features we have a total feature ranking process and what are some of the features we have been using in our remote sensing discipline for a long time so why don't we use them also overall we use restricted boson machine to do the unsupervised part of the deep learning component and then we have a supervised component which which actually takes the initialized weights from the rbm network and it's more of a stacked rbm model now why this is important because essentially we need this unsupervised architecture to create these weights because we have limited training data that had been a problem for a long time for us now so and that's that's addressing your problem of global classification of trees when you're talking about just classifying trees for the US you can't just manually go on doing training samples right for the whole of us so you need to have representative samples again if you have too many samples you end up having over estimations or under estimation right so you need a trade-off right so and this is where we found out that where it where this algorithm outscores the traditional algorithms in classification is this unsupervised classification right so we're actually creating these high-level features out of the input features which we are already giving it to the network and those features are much better in terms of converging in a much faster way again you know I can go offline on these topics we are using the restricted boson machines for our unsupervised learning and essentially it's a maximum likelihood learning for the RBMs we use contrastive divergence learning it's a faster alternative to maximum likelihood again won't go into the details let's go actually to this part yeah so we have this training phase that's more of the learning module here so we had the unsupervised learning each layer is a dbn is and it's trained using contrastive divergence and with repeated Gibbs sampling if you want to know all of these terminologies can discuss again after that lecture and so it initializes the weights of the network using the using using the RBMs and then it goes to supervised learning where we have our own set of representative training samples and which we saw that it actually performs really well when we scale it to other states that's what we have been finding out right so and when we are talking about accuracies if we even reach 80% accuracy it's a lot and even though people claim that they have been having this high accuracy I don't believe so when we talk about the NLCD which has been done a really good job for the US classifying land cover classes of 30 meter if you just take the urban landscape you'll find out that trees are missing right you cannot classify 30 meters so if you look at the urban landscape anywhere there are no trees essentially at the NLCD scale now if you aggregate all the Landsat pixels for the whole of US and then say that okay all the urban trees are missing from this land cover imagine the uncertainty you have coming out of it right especially for biomass mapping and carbon and stuff like that right and I show you some of the results we have we actually use structured prediction using conditional random field also again we can go into details experimental results well this is what is more important for us for the whole of California we found out that 82.59 is the overall true positive rate and the false positive rate is 1.73 that is more interesting right when you compare it to the NLCD which we should not ideally because it's a different skill but that's the state of art right now right so true positive rates 72 versus 87 look at the false positive 50% NLCD and 1.9 again we're also using other models recently came out with a paper called deep sat it actually involves more of a feature enhanced dbn model and compare it to the traditional dbns and convolutional neural networks and I won't go into details here but we had again data sets which we constructed from the native data and if you look at the results our results actually almost 97% accurate for all these two data types if you compare it with the stacked auto encoders it is 79% versus convolutional neural networks with 86% so this is a big shift right 10 to 11% change as some of the results here just showing San Francisco Bay area I mean this is exactly the example I wanted to show you like if you see here this area highly granular right in NLCD this comes up at white completely white no no trees right and you are interested and this is exactly what we're interested in if you're doing carbon monitoring if you don't have this estimate from these regions you're missing a lot right even for urban heat island applications right so this is a semi team or simpler but let's go to the big one this is actually the tree cover mosaic which we did for the whole of california looks pretty good we had some list classifications over water bodies but we figured that out why it was happening it was interesting you know you don't want people to distribute data sets and people finding trees in oceans right so we say okay it's chlorophyll so then we're also doing yield prediction that's been our more recent exercise using the belief network again it's a hybrid model I'm running short of time but we could discuss it offline I wanted to show some results actually I won't go into this but some of the yield predictions we have been doing using Landsat and ground based data sets this is actually using very high resolution data sets some of the other applications we have been doing some rooftop detection trying using our algorithms just to see what it takes and this is the most interesting part we're actually trying to segment fields from these data sets automatically now given the segmented fields you could actually hand it over to the farmer saying that hey can we just have a look looks good or not and he says okay it looks good and he can redefine the boundaries but this is actually the most I would say the hardest problem we've been dealing with till now just segmenting field boundaries I don't want to talk about this well summary essentially well any x lowers the barrier of entry allows knowledge sharing earth sciences is big data we have a lot of machine learning applications you know intelligent information prediction estimation the machine learning is great we're understanding the data is critical that's where the domain expertise comes in and scale and generalized learning models such that they're applicable to different domains and that's about my talk here and I would encourage any questions thank you