 TheCube at Hadoop Summit 2014 is brought to you by Anchor Sponsor, Hortonworks. We do Hadoop. And headline sponsor, WAN Disco. We make Hadoop invincible. Hey, welcome back, everyone. We're here live in Silicon Valley in San Jose for the Hadoop Summit 2014. I'm John Furrier, the founder of siliconangle.com. Here with leading industry analyst, Jeff Kelly from wikibon.org. This is TheCube, our flagship program. We go out to the events, extract a signal from the noise. And our next guest is Mark Lowerson, Director of Research and Academics at Calgary University of Calgary. Welcome to TheCube. Thank you. So you're a show horse here at the event. We call show horse, someone who gets trotted out to the panels and the analysts around, really being an example of some of the use cases. So Jeff and I were talking prior to coming on. Share with the folks why you're here. And we'll talk about some of the cool things you're doing. Like I said, we're here because Hortonworks in this particular scenario, likes our story. They want to have us come down and tell people how we're using their tools to support research at the University of Calgary. Yeah, I guess that's the reason we're here is to help them out. And also this is the place to come. And I drag a whole bunch of students and allies with me to see what's happening in the other industries. You know, we work in a very siloed place in health research. But the reality is problems we need to solve are being solved by other members of the community. One of the things I really love about this data in our ecosystem is that it's a blend of computer science and data geeks, which essentially means you're either in operating systems or databases are in between. But more importantly, you're seeing the spectrum of academic to industry really integrating, right? And so you got a lot of new students. So I'd ask you the question, the young guns coming up through the system, what are they like? Are they deer in the headlights? Do they like, you know, just blind ambition? Do they just jump before they ask? I mean, what are some of the characteristics of some of the talent coming up? Honestly, I would argue that there's not a lot of talent coming up that knows what they're getting into. It's not that there's lots of keen and talented people out there for sure. But the universities are big slow moving beasts and they don't necessarily have in place data science schools, you know? They have molecular biology schools and genetic schools and statistics schools and computer science programs. And it's the people that accidentally traverse those schools that become the good candidates for working in our industry anyways. Deer in the headlights, no. It's, you can't be a deer in the headlights if you don't know what's coming, right? You don't see headlights yet. So you have some talent coming in. They have no idea what they're coming into. But they're young kids tend to think like that. What about the skill sets? Are you seeing any patterns around the certain skill sets? Yeah, we struggle to identify people that can get comfortable with the idea that you can't look at your data. You can't look at the, the data is so volumous that you can't just scroll through it as it were. There's a big pool of computer science and statistical training that we call them traditional analysts and they're traditional statistical analysts. They're smart people, good at math, good at whatever the statistical suite that they operate on is, but they have never dealt with data that's so large that the only way to physically understand the features is through analysis as opposed to simply by looking at the data. So that's an interesting challenge for people coming right out of schools. The big thing that we're seeing a lot of is, we're in possession of real data that's complex and big enough to be called big. And our computer science colleagues are finally turning around and going, oh wait a minute, we've got these people who we've been trying for years and years and years to give real world use cases. We can actually hook together with you guys and do something meaningful for these students. So the generation of students that's being trained right now will probably be the first generation in our region at the very least who have computer science, have some health background, plus have had the opportunity to work on the two things simultaneously, which I don't think that's happened a lot yet. So let's level that a little bit in kind of the academic research fields. Kind of explain if you could, what kind of the way, or some of the approaches to data analysis were, kind of the old ways, and now that you're here, you're here to do summit. How have new technologies such as to kind of change the way you go about collecting data, analyzing it? Is there a mindset shift that's required as well? How has it kind of shifted with this new paradigm? Yeah, absolutely, absolutely it shifted. Things that simply weren't feasible before are now feasible. So we had a good example as somebody shows up in our office with the big electronic medical record data set with billions of observations and they've gone and paid a whole bunch of money in licensing and things like that to bring that data in and they show up with the expectation that they can throw that into SAS on a good work station and proceed to produce meaningful research results and the reality is that didn't work. We, nothing against that particular platform is just not designed to deal with that scale of stuff on somebody's desktop. We have, as a result of that kind of failure, we went out and we procured a Hadoop cluster and on that cluster, those people kind of, the day we were up and running, it just went right back to like it was prior to the cluster being there. They now use that as their source of analytic computational horsepower and they have, there's decent tools in the ecosystem now. There wasn't a year and a half ago, but there's becoming a good pool of decent tools there and those tools are letting us let those analysts be analysts again as opposed to having to struggle with different access paradigms than they would have had before. So it really allows them, you know, similar benefits we see in the enterprise just allows you to really analyze all the data versus being restricted to what you can fit on your laptop. Yeah, absolutely and I think the power that comes with that ability to look at all the data at once is this is the new generation of analytics that we're talking about. We're able to write research proposals that talk about problems at a population scale as opposed to problems in a subset of the population that we carefully selected and curated to explore the problem. So you'd have to pull out a subset and then try to extrapolate based on the results of the analysis, how that might apply to a larger population. Yeah, or even bigger, we simply didn't have data on entire populations in the beginning and over time we've assembled through tertiary access and external access data sets that let us discuss things at population scales where previously, without the technical capacity to do so, there was no impetus to go and assemble a big data set like that. We were talking briefly before we went on about how kind of big data and not just big data technologies like Kadoop, but just this whole kind of cultural shift to where we're in FitBeds and there's so much data collection going on in the real world. How that's impacted kind of the research process. Talk a little bit about that, that was very interesting. Yeah, absolutely. The reality is we used to sit down and dream up hypotheses and write them into grants and fire those grants off to committees and the committees would say, this idea has no merit, bugger off, come back again when you get a better idea. That process obviously works reasonably well, it worked for a long time, but nowadays we have the ability with Fitbit data, for example, or Tweets or what have you, to actually go out and explore the characteristics of the populations we're thinking of studying before we ever go and do it. A good example is, can we start emailing patients, surveys to fill out, to ask questions about it? Do we have information about that population that says 80% of those people will respond in a sensible fashion if you just send it to them over Facebook or Twitter or something, as opposed to the old, more traditional, you command, you get a clipboard, you sit down, you answer your survey, that survey gets put in a pile and somebody enters it into a database. Patient reported outcomes in healthcare are, that's a buzzword, but it actually provides us with a whole other avenue to explore the conditions that we're interested in exploring. We have, without inconveniencing patients or subjects or whatever the right term is, you can just say here, wear this bracelet. And if you wear that bracelet, I'm gonna know how active you are and I'm not gonna have to ask you the question when you show up at my office for your diabetes, check up, have you been doing enough exercise? Because guess what, I know enough about your exercise and there's an economy of little applications to help people quantify their own health and their own undertaking and with the right plumbing, those applications can feed the doctors. So we're working on a project right now with an epilepsy research group. They want to give a smartphone app to their epileptic patients that they can just click a button that says, I had a seizure and it asks them three or four questions and they hit save and they carry on. That button gives us the ability to proactively book appointments for people based on changes in the data. It gives us the ability to reach out to patients when scenarios come up that we wouldn't have seen before. Historically, your epilepsy doctor would get from you when you arrived, your seizure calendar, which was a piece of paper where you put little notes on every day that you had a seizure and he'd put it on the table beside the one from last month and go, oh, there's less blue ink so things are going good. Nowadays, you can say everything about the frequency, everything about the rates and change and actually use math to predict outcomes, which is, this is a new era for sure. Right, so it sounds like it's really lowering the barriers to collecting data, whether it's just making it easier to collect the data or also making it less expensive so you don't necessarily have to go through that grant process, at least in the beginning. Yeah, so at the moment, we're not quite there. We're not quite to this point where we can't afford to just do research on the backs of ourselves because the reality is that technology, it's helping, it's reducing manpower, but it has costs. But what we're able to do that I think is the biggest value add of the whole, the fact that we have big repositories of health data on people is before we submit an idea for a scrutinized, regulatory compliant undertaking, a research undertaking, we can actually go out to some population scale datasets that might have existed already and investigate that idea. And actually, perhaps either dismiss it as not relevant, we don't have enough power, there's no way we're going to pull that off with the parameters of study that we might design or give yourself validation that your ideas sound and perhaps change the model of what you are going to go do, but when you approach somebody for support of that idea, you have facts in hand as opposed to well-informed hypotheses based on clinical expertise versus data, right? So data-driven research program design is a real thing nowadays. We're actually able to determine what the next step in the research process is based on the resources that we have available. So you talked about kind of that hypothetical approach to problem solving, but we hear a lot about in the world of Fadoop is it kind of allows you to do more of that kind of exploratory type analytics or maybe you don't even know the question you're trying to answer. You're just kind of looking through data to see if there's something interesting you can find. It sounds like that's a little bit not the way research generally works. Are you able to adopt some of that type of analytic work in addition to the more rigorous types? Absolutely. I mean, I raise that point as a concern only because it's a concern. You know, we have the ability nowadays and the horsepower computationally nowadays to just say give me every possible correlation, every possible test, every possible connotation of whatever data sets in hand and you can do that in an afternoon with the power you have and give me another massive data set and I'll tell you everything about it two days from now and you can drag and drop, you can hand this work off to people who are not methodologically trained or prepared to do anything in a repeatable or reliable fashion. There's nobody in the room saying that could not lead to, the opportunity for that to lead to some big discovery, it's really legitimate. There's nobody challenging that fact. But what comes along is if you've spent a whole bunch of time looking for correlations, you're going to find correlations and the ability to come back to a biological definition of why those correlations are there is compromised when you know it's there in the first place, right? So if you sit back and think of scenarios why that correlation will be there before you test for it, you're not introducing bias into your likelihood of discovering it. Now, with that in mind, we do have mechanisms around that. We do our exploratory stuff on a segregated data set that is not part of the data set you would use if you were going to actually do a hypothesis-driven undertaking. The community of health researchers is by design and by training a very methodologically rigorous group of people for good reason. We don't want to have drugs that come back and we're not what we thought they ought to be. The rise of the printing press did not cause the bubonic plague, although there's a correlation there that says that did, right? But I think the opportunity to speak to the idea that we should have some methodological rigor in what we're doing isn't negated by the idea that we can point and click our way to correlations and big fields of variables. We just need to be careful about how we do it. Right, so you've got to apply that rigorous approach to actually bringing in new styles and analytics. So I'm curious, just in terms of the actual hardware, I mean, you're using Hadoop. Imagine you probably have some pretty big clusters. How do you go about actually provisioning all that? You were in the analyst event yesterday, I was there. You mentioned a little bit about kind of using some shared resources among organizations. Talk a little bit about that. Yeah, so we're, again, as a university, we tend to be partners in pan-academic undertakings and there's an organization in Canada called Compute Canada, which is a conglomeration of research universities and each of those research universities has some high-performance computing department. University of Calgary, by all standards, would probably be a moderate-sized high-performance computing undertaking, but a fairly proficient one. They do a lot of astronomy, a lot of genomics, a lot of proteomics on traditional MPI, 10,000 node compute stuff. And what's happening is actually our group, just through a proactive IT business partner, went out and procured on some decommissioned hardware, a little test Hadoop cluster, and we played around with it, solved a problem that we were having with over-stretched traditional relational database implementation by moving that data to Hadoop and it gave those analysts the power they needed to go and actually use the data that was there. That test implementation was so successful and we have so many projects as a result of that that we went out and actually bought a new, dedicated Hadoop cluster for my small team of analysts. Again, a little bit of success, a little bit of positive work and now Compute Canada is turning dust saying, how can we work on this? How can we support you guys? Because currently we support researchers across the country in the traditional MPI parallel computing world. How can we implement something that allows us to elastically provision Hadoop on demand for the research needs of, it'll be of the whole country in the fullness of time, but in the moment it's gonna be Calgary. So what's happening in those traditional MPI clusters, which there's lots of, they're taking chunks of them and they're making them elastic. They're putting OpenStack on them and they're saying we're gonna still use them for MPI and the VMs are gonna be spun up on OpenStack and it's gonna work that way. Well, we now have the ability to write puppet scripts to spin up Hadoop clusters on that new elastic architecture so that the regular MPI consumer now has another option. And we're really just right now in Calgary anyway, we've made that migration that our primary big cluster is fully puppetized, which is a terrible phrase, but it's fully puppetized. So now we can actually say let's hit restart and spin the whole thing up as a Hadoop cluster or let's hit restart and give it 6040 Hadoop traditional. We have that ability to elastically provision. We're very fortunate right now the demand for Hadoop stuff is just hitting that threshold that we're the ones in town that know about it at town and we're using it and people are excited about what's coming of it and the students are starting to slip over from their traditional MPI world to Hadoop-driven distributed computing, but it's only just beginning, right? So, but the university's been very good for us in proactively adjusting their traditional services to suit the demands of our researchers. How do you see that Hadoop impacting some of the more traditional approaches you see eclipsing it at some point? Is it just too early to tell? What's the impact going to be long-term? I think at the moment it's not, performance-wise we're not quite there. Workflow-wise, if you think about the simplest possible Hadoop undertaking, it's the reducer-less map job and you just put your math in a mapper and throw it across the cluster and out comes all these files that were the answers and that's fairly in line with the MPI ideal other than there's some frameworks that do aggregation after-fact things. Our performance on Hadoop is nowhere near what you can do in C on MPI at the moment, but we, in one particular genomic scenario, we're trying to build a giant relationship matrix. We were able to reduce compute time just on Hadoop through iterative figuring out of things from 12 hours to seven hours to five hours to three hours, the same data, same problem. So it's coming, it's getting there. I think we'll only see the MPI guys starting to trickle across when we get convenience enough of an easy framework to work in and speed that's not completely different. Hey, well, thanks for coming on theCUBE, really appreciate it. Day one of two and a half days of coverage here on theCUBE, this is Hadoop Summit 2014. I'm John Furrier with Jeff Kelly to write back with our next guest.