 Live from the Fairmont Hotel in San Jose, California, it's theCUBE at Big Data SV 2015. Welcome back everybody, this is theCUBE. We are live at Big Data SV 2015 in San Jose. I'm Jeff Kelly with Wikibon. I'm here with my co-host, Jeff Frick. We're joined in this segment by John Cardente, who's a distinguished engineer with the corporate CTO office at EMC. John, welcome to theCUBE. Thank you, thank you for having me. So we've been talking, we've been covering Big Data Week here on Big Data SV. Obviously we've got Strata Hadoop World going on. What are your impressions of the week? It's great, Strata is always a great conference. It's great to see sort of the mix of some of the academic stuff, the research stuff and some of the commercial stuff. You get to see how the two worlds are really coming together and how things are starting to transition from one to the other. Yeah, absolutely. And one of the themes we've been covering this week is, and we've been hearing over and over, is that the conversation is definitely shifting from the infrastructure tools and technology to some of the use cases, some of the applications that are going to be able to be built on top of the great technology that's been innovative. And that's what we're going to talk to you about today is some really interesting project you're working on around essentially a climate data lake. Why don't you tell us a little bit about, tell us about the project. So our corporate sustainability office has been interested for a long time in finding a project where we can bring our technologies to bear on something meaningful for the climate. And so earlier in the year, they started talking to a nonprofit called the Earthwatch Institute that's based out of Cambridge, Massachusetts that are on this idea of investigating this particular area of science. And so, you know, climate change is happening, right? Regardless of what you think the reasons are, there's a lot of evidence to say that the climate is happening. But what most people don't realize is that climate and weather is nature's clock. A lot of things in nature happen based on weather patterns. And so what's starting to happen as the world warms and the seasons shrink is that you have things like trees budding and fruits flowering and insects emerging earlier, whereas migratory birds are still on the same schedule. And so what we're finding, what the scientists are finding are that these birds don't have a food supply to power them through their migration. So it's starting to affect the populations. And so there's this hypothesis and some evidence that it's happening, but the data sets required to really dig into this to really understand how big of a problem this is are all separated. They're all being collected and maintained by different non-profits. And so with Earthwatch, Acadia National Park, National Phrenology Network, and the Scootick Institute, we're bringing all this stuff together in one data lake to start to enable scientists to collaborate more meaningfully. So we'll talk a little bit about the concept of the data lake. We hear that a lot on theCUBE and elsewhere, but this is really going to be an example of applying that concept to a real world problem. So I like to think of a data lake as more of an analytics environment. And I think simply focusing on a particular layer in the stack doesn't solve the full problem. And so at EMC and the corporate CTO office, we take the view of what are people trying to do with data? What's the environment that they need to fulfill their goal? And let's start from there. And so in this case, we're not only bringing in some of our storage technologies, but also all of our pivotal technologies and our cloud foundry technologies and stuff to really build in the long-term a platform that enables this thing called citizen science, which I refer to it as the Hadoop for phenology. And phenology is the science of timings in nature, because it requires people to go out and observe in the field. And the only way this scales is to engage citizens to go out and do it for us. And so that's really what we're trying to do is not only enable scientists to settle in on particular data sets, to set aside any debate on the data and not the science, but also to engage the citizen scientists so that we get more data, yeah. Kind of bringing together kind of the crowdsourcing concept with big data. Exactly. Exactly. And kind of the data lake has almost a data OS, right? Because what are you going to do that it's really not just this static thing that's sitting over there. It's actually kind of a part of the whole process. Exactly. So I almost shy away from the term now because I just like the, thinking about the whole picture, right? I don't want to get distracted with what any preconceived notions or data lakes are or what those kinds of things. So thinking about the whole environment really and what we're trying to do there is a full stack. Right. And then it just seems also interesting too as you're collecting all this data. You said a lot of it's human generated. Absolutely. It's the citizens that are distributed. But you know, we're all packing these things too. And as the sensor data gets better and better and better in these, you know, in term, including climate data, it seems like the opportunity to get kind of pinpoint data points. Absolutely. And so in a high level of detail on a number of factors it's just going to go up exponentially. So we did a one week expedition in Acadia National Park in October where we brought, you know, folks from EMC, folks from SCUDIC and some of the research scientists all together to not only get a better understanding to sync up on what the science was and what the problem was, we did our own citizen science exercises without watch birds and stuff. And so those of us from EMC, you know, we're like, well, this is great. We'll just make a mobile app. You'll be able to plug it in and we're good to go. And what we're finding is that there's a big population or proportion of the citizen scientists population are either elderly or people who don't have access to technology. And so they're still very paper-driven. And so, you know, we're actively trying to work with all the organizations to figure out how we can accommodate those folks but also start using some technologies to make this a little bit more efficient but also more engaging in the sense that we can give people alerts to say, hey, that data that you contributed, it just showed up in this research over here. So, you know, here's a, thank you. Thank you for that. Right, an appreciation, but also some alerts to say, we think that there's going to be something interesting to happen. So if you had the time, can you go out to this place and look for some migratory birds and stuff? So we think there's a lot of value in that but we have to get over the, you know, the lack of technology in some areas. And are there some real specific applications that are driving this forward? And then there'll be other applications that come out of it. You talked about the specific bird example and migratory birds and their food sources and things blooming. I mean, is that, are there a few of those that some really interested party said, these are the things we really want to dive into or did that kind of come out of a more general purpose approach and was something that surfaced? There's so many, it seems like the interconnections are so vast. Well, in a way it's a win-win, I think, for everyone in that here's an area of very important science that needed some help to really get a handle on the problem. But I look at it as a great use case for our at scale collaborative analytics, right? So if we really focus on engaging the wider citizen science population, both giving them tools to visualize and explore the data sets, but also eventually maybe do their own modeling and engage with the scientists, that model of collaborative analytics is broadly applicable, right? So we can bring that back to the enterprise and we can start to enable enterprise customers to do some very sophisticated things. And so that, not only are we motivated to help out with the environment and 100% of our customers live on this planet, so it's good to take care of it. I appreciate it. Until we colonize the moon. Yeah, until we colonize the moon, that's right. But it also gives us a great proving ground to develop some technologies that we can then bring back to product. I was just gonna say, and then what about historical data? Historical weather data, historical public or government data sets. Are you incorporating that? Absolutely. What's the scale of this type of data set? So right now we're taking basically one time dumps from a lot of different data repositories. Some of it is from the NCDC, the National Center for Climate Stuff, and then we're getting dumps from some of the nonprofits that have been collecting this observational data. But over time, what we're working with all these partners on are a way to make connectors and feeds so that we can get incremental updates and start to act as a clearinghouse for some of this stuff. And then maybe also a hub of sorts for researchers to share data through the singular repository. And so over time, the data's gonna grow. Hopefully it gets more dynamic and more real time. Right, right. Do you see this as a potential model for addressing other big societal challenges? Generally, and specifically for EMC, is this something that's a priority for EMC to donate your time, your expertise, your technology to tackle some of these issues? Yeah, I think you can. I think, and that's one of the motivations is to, if we're successful, broaden this out to support other initiatives like this that are for societal good. And so we're really excited about this project and we're really eager to build something and do something meaningful on them and we're hopeful that we can expand it to cover more things. Because I think it's, again, it's a win-win of source for everybody, right? It gives us a good demonstration vehicle for our technologies and what we can do. It's a proven ground for us to develop some new innovative things, but it's also, we're giving back to the community. So it is very much a strong interest of many of the leaders in EMC. Yeah, well, congratulations on the project. I applaud EMC for taking these steps. I think it's really important. And it really shows what big data can do. It's not just about some of the more, not just about the technology, it's not just about some of the more commercial aspects, which is very important, but really can tackle some of these bigger challenges. Absolutely. John Cardente from EMC. Thanks so much for joining us on theCUBE. We appreciate it. Thanks for watching. We'll be right back with our next segment after this.