 Live from San Jose, California, in the heart of Silicon Valley, it's theCUBE. Covering Hadoop Summit 2016, brought to you by Hortonworks. Now, here are your hosts, John Furrier and George Gilbert. Okay, welcome back everyone. We are live here in Silicon Valley in San Jose for Hadoop Summit 2016. This is SiliconANGLE Media's flagship program, theCUBE, where we go out to the events and extract the signal from the noise. I'm John Furrier with my co-host, George Gilbert. SiliconANGLE Wikibon analyst and big data. Our next guest is Kenneth Sputow and Jay Etchings, Director of Operations and Director of Computational Science at Arizona State. You guys doing some great stuff here. First of all, I'll talk about what you talked about here and your talks here at Hadoop Summit, because this is the value of the data. You got the perfect storm of compute in the cloud, lower cost, Moore's Laws kicked in for years. Where's it going next? What are you guys talking about here? To me, it's all about how this platform is transforming the work we do. That it's actually allowing us to really ask and answer questions that we just simply wouldn't be able to approach in the absence of having these kinds of capabilities. I've been in the field for a little while, as you might suspect, from my literally gray beard. But we're actually really at an inflection point where we really can, by leveraging these computational tools, do just what examples did you use on stage? Share with the audience and take a minute to share some of the examples you guys are doing. Some of the work you do that wasn't possible before. That wasn't really there possible before. The key thing that we're able to do is really fully explore this multi-dimensional space. All of the different components of the molecular characteristics, the wet sequencing of individuals and connect it in ways that hasn't been done before. So we were talking about on stage, talking about how we were doing this in breast cancer, trying to better understand how the cancer is fundamentally different than the other normal tissues and being able to find patterns that, again, similar to what people are doing in e-commerce and what they're doing in all sorts of other dimensions, we're actually trying to bring down to biomedicine. And Shay, the machine learning involved in all this stuff, what's the technology? Is it just blocking and tackling, just raw compute meets data? So I'll say, typically in research institutions like Arizona State University, there are a lot of disparate silos of compute that perform very unique and specialized tasks. Bringing in the data-intensive environment and bringing in the Hadoop ecosystem has allowed us to bring those elements together and optimize them. So there's three places there when you look at the ability to ask new questions and do new research, the ability to reduce the time to research by taking legacy jobs that used to run in silos and optimizing them, and then, as Ken said, the ability to mine, perhaps, non-obvious relationships through machine learning from the large silos of data that we've now aggregated. So how did you guys get here? I mean, this is obviously the work you're doing, but from a technology standpoint, you just, the truck back up and you've got boat loads of clusters of Hadoop. You share us inside the ropes of what you guys are doing. So Ken and I started working together, it's got to be now three, four years ago. We were working on a specific project around some big data precision medicine in the cloud and how we would extend this out and shortly after, I accepted an appointment as the director of research computing and we went on to take a plan that had been theorized back in 2012 called the NGCC, which is the Next Generation Cyber Capabilities, which is a single unified fabric that has the components of computational compute within it, Hadoop and some big data capacity as well, and also including things like GPU compute, putting them all on a singular fabric. So in starting four years ago and kind of building this at the university, it's only been over the past year and a half, two years that we've really started to mine some gold from that infrastructure. We're here quite intentionally. So this isn't actually one of the things that just something fell off the truck and we actually put it in place. Actually, credit where credit is due, ASU leadership actually recognized that we want to be the new American university and to actually support our design aspirations of transdisciplinary research and the capacity to look at problems in ways that no other groups do. We actually were going to have to have leading, leading edge technology. We were going to have to have the right tool set to do it. So we were charged with saying, if you wanted to be approaching these complex problems in the 21st century, what would be the infrastructure that you would need? And you went about creating it. And you guys felt this guiding principle of obviously the compute's key and having the data. You guys feel from the beginning that the answer was in the data. And you guys have that idea in mind like, hey, there could be gold in those hills there. And did you go in with that kind of mindset from the beginning? Was that the original idea? Certainly, so there are many large scale studies and there are many large scale data repositories that have potential goldmine of information hidden within. However, none of those interoperate with one another and they exist as silos. And so the approach of what we're doing at Arizona State University is to bring them together and to essentially shine the light on all this data and not only reduce time to research, but find interesting facts that would be exciting. So two questions that made me related. Are you able to bring those silos of data together after the fact? And would you compare this ability to generate data on your observations to sort of like a new version of a microscope or a telescope? So what are the new things you can see? So funny you should say, we actually, one of the metaphors we use for the NGCC is actually that it is the next generation telescope. It is the actually next generation observatory and like many other fields that are transformed by having these new instruments, we actually think that having this data examination capabilities just as powerful as that. And so to the original question, we for sure think the answer is in the data, but number two, we actually think that figuring out novel ways to actually parse this stuff apart is going to be what's critical to actually answering these questions. I love that analogy because the observation space has to be set up in order to observe. So if you have the silo problem, healthcare has been living this problem for a freaking decade, I mean it's a nightmare, right? So how do you open it up? Have you had resistance? Was it like, yeah, you guys are heroes, is there a parade for you guys every day? So the challenge is still very real in biomedicine and I want to be absolutely crisp that there is no sadly, no magic, in how stuff comes together. The big data frameworks give us a platform that reduces friction as much as is practical, but assembling heterogeneous data still has all of the complexity and the hard work of how you're figuring out interconnecting things that. So there's brute force involved. There still is real human effort in decoding and recoding, but again with new machine learning and the analytics for discovering late and semantic patterns and all sorts of other things, we can bring powerful tools to find things that might be able to be connected, but I don't want to discuss the sweat equity. So to follow on to that, what I would say and with the innovation that the HUP ecosystem has brought to the university, it's reduced that barrier to entry to get in these diverse data types and it's become less of a technical challenge than it was before and much more of a collective action problem because now there are still institutions out there that are developing their one off proprietary databases and would love you to send them all your metadata, all your genomic data and I just don't think that that's the way of the future. So we maintain a copy of 1000 genomes, the cancer genome atlas, but plus we have a wide variety of research data that we have within our institution that's now available to all the researchers at the university. And you need access to more diverse data types. It could be everything from weather data, geography of the patient, I mean all kinds of weird kind of, observational data points to have visibility and you can't look at something with no data. So I got to ask you the question. So there is some brute force and yes, there's some things technicals getting better. What would you guys say to the audience watching? Cause this is so fascinating and it's real life example of the value to society with this kind of platform and the way you guys are thinking about it. But what's exciting you right now about some of the advances going on in the technology space here because like it's only going to get better and where as you see the acceleration of new things, what gets you excited right now and where's the problem areas where come on, go faster. We actually might have a little bit of a different answer here but I would say the most exciting component that I've seen even over the past year is the emphasis both at the federal level and at the research institution level around open source. So in the past, if we were to develop something and we were to do it on a particular platform and I wanted to share it with colleagues or peers at another institution, if they didn't have whatever that infrastructure component was, they were unable to participate without paying some sort of entry fee. Now we can develop these things, we can turn them back over to the community. A great example is the genomic pipelines that we're working on that are open source because the group can do it far better than just our individual group. And I would far beat for me to ever speak against open source so I couldn't be more violent. Plus one retweet. Yeah, like, exactly. But to me technologically, and actually it connects to the question that was asked just a minute ago is that I think the things that I'm seeing developed here that I think is still really exciting and needs to go faster is this whole framework around metadata. So it's one thing, I mean, the good news is that we can expeditiously bring huge volumes of data into common electronic frameworks where we can get access to it. But especially in our universe where the data is so heterogeneous, it's represented in multiple forms, it's knowing data about the data, well in some of its healthcare data, so you need to know data provenance, you need to know who has access controls. So data about data, and I know it's not a particularly sexy thing because it doesn't show up on cute curves and other things like that. Yeah, no, you got to have addressability of data. You got to move data around, have it be addressable. You want to tune the telescope in, you got to maybe pull in some contextual data that focuses in on the problem, right? I mean, I know we've got time sink here with the clocky, but I ask one final question. For the folks watching that might be in the enterprise space, you guys are doing some cutting edge research, you got access to resources, the big brains at Arizona State University, so there's a lot of great stuff happening. What one thing would you say to them that's been flipped upside down? In any new paradigm that goes on, there's always one thing that just, it could be architectural, it could be business. What architectural thing would you say has been flipped upside down in his head and what kind of business logic has been flipped? I'll try to say this quick because there are some granular pieces that you see continue to change in the ecosystem that are exciting, but I think the one most exciting piece that I would look at is just kind of the redefinition of the scientific model where we used to come up with a hypothesis and go out and collect data and then run tests against it. Now all the data exists, and as Ken had said, we need to figure out what questions we can ask and what questions are even askable, right? That's exciting stuff. Yeah, I would actually say the advent of data science as a full partner of a scientific industry. I mean, we talk about the development of computational fields within each industry, but what I think is actually sort of interesting is now almost, not quite there yet, but almost an equal seat at the table now would be data scientists as a mature domain in and of itself that may hopefully, when we talk about the great fourth paradigm of science, we're actually now literally doing work around data as opposed to generating the data actually is the science itself. Data is the code, data is the development environment. Guys, thanks so much for sharing your story. Juan at Hadoop Summit for the whole crowd here, but also coming on theCUBE and sharing them. We're like a telescope, getting the data here from the experts. Thanks so much for sharing your story. It's a great story. Real-life example of how work is being done, certainly with breast cancer and cancer and other research, just amazing application that changed the game. Again, thank you. You're watching theCUBE. I'm John Furrier with George Gilbert. We'll be right back with more CUBE coverage from SiliconANGLE Media after this short break.