 Live from the Fairmont Hotel in San Jose, California, it's theCUBE at Big Data SV 2015. Okay, welcome back everyone. We are here live in Silicon Valley for Big Data SV, part of Stratoconference of Duke World. Big Data Week here in Silicon Valley and all the actions happening here inside theCUBE and out around in the facilities here at Convention Center. This is theCUBE, our flagship program. We go out to the events and they strike the signal from the noise. I'm John Furrier. My co-host this week is Jeff Kelly, Big Data analyst at Wikibon. Our next guest is Chris Poole and principal partner, Patterns and Predictions. Welcome to theCUBE. Big Data Week, predictions, patterns. What are you seeing? The analytics are huge. The killer app again this year is analytics. But interesting dynamics are changing. Obviously the market's evolving. We've always been talking on theCUBE. Even three years ago, Ping Lee, who was on our panel last night for Microsoft Partners and Mike Olson. This is the year of apps and we got 100 million dollar funds. Just never happened. And analytics was really the killer app for the first few years of Big Data because it was easy and that's the market was growing underneath. Now integrating into applications is a big topic and machine learning is now front and center. It's native. People are moving it out of the academic into mainstream. So what's your take on all this? I mean, the computer science, the algorithms are all here. What's the state of the union that you're seeing? Some of the patterns that you're seeing happening this week. Well, I mean, I think that one of the topics that's been brought up in earlier segments has been sort of solving a problem for business, right? So that sounds like some sort of trivial speak or some sort of marketing pitch, but ultimately it's a really practical problem which is that for all of our successful projects, whether it's been military mental health or financial prediction or any other application of Big Data, ultimately we've solved the problem by labeling what is defining what the problem is and then optimizing the solution towards solving that problem. So the reason I say all of that in abstract terms, the practical terms mean that you need to solve a specific problem and then you need to find all of the big data tools to get you to solve that problem. And then I think the final point to answer your question directly is why haven't we seen sort of the explosion of analytics is because people have been more focused on the tools and the cool whiz-bang things you can do and how fast you can do the runtime and how fast you can solve the problem, maybe what your accuracy rates are. In fact, last year I think you saw a consolidation of a lot of machine learning shops like DeepMind and people that compete with us and really the benchmark for their success revolved around the stellar accuracy levels and the sort of unstructured and unsupervised learning that they were able to demonstrate that maybe Google wasn't, even Google wasn't demonstrating by itself. So that is the answer in that people are not solving big enough problems and they're just sort of adding some sort of runtime optimization or data lakeification to the standard customer service. Data lakeification, love that term data lake, man. You got to get data ocean in there. That's right. I'm going to make that a standard just by saying it. No, but so let's get the data science because data science, Obama's up there, you got to know one of the industries, our industry's own is a chief data scientist in the government, huge new changes in the government. It brings the visibility of the data science. This is your world, this is the community. What is it rock star status in your mind right now for data science? Are you seeing that, is that just a promotional thing? Or do you see that swell really coming mainstream on data science? Or is that a gimmick by the Obama administration trying to be cool? Well, no, I mean, I think that we're somewhere in an inflection point. I mean, I have this own mental question from an HR perspective, it's very hard to hire data scientists. I can tell you, and people that have, it's not hard to train sort of low level data scientists, especially with all the analytical tools I was just talking about and all the new whiz bank things that are coming out in the past two years, but people that have sort of seen that tried and true workflow and sort of know the pitfalls and know how to deploy statistical analysis at scale and then not just at scale, but in a way that dashboards so that the executives and the people that are stakeholders and the people that write the checks can actually buy into it. So that's sort of where I see data science from a data science leader perspective, from a purely scientific perspective. I do think we're in an inflection point, which in so does Google, right, with this deep mind acquisition, where you are seeing human-like performance in a lot of these automated systems. And since you are seeing human-like performance in modalities that up till now humans have dominated in, you are seeing a true emergence of automation on classical big, no, I guess I'm saying classical big data, like it said. Like, you know. Four years ago. Yeah, yeah, old school, classic rock, come on, 60s. Yeah, the 2000s. Well, machine learning's been around for the 60s, so we, you know, that's classic, I guess. Well, why don't we take a step back? Tell us a little bit about patterns and predictions, because it's an interesting story, kind of how you guys got your start and some of the early work you've done with the Durkheim project. Yeah, so patterns and predictions was actually started in collaboration with a Danish analytics firm or a Bayesian mathematics firm about 12 years ago we started out. And our idea was sort of to create like a SAS light, like a open tool that was a reduced complexity library that people could use to then do predictive analytics. And pretty quickly we realized that one of our niche points was sort of unstructured data and processing different data types. And again, this is sort of 10 years ago, and so back then I was sounding very crazy. Yeah, I bet you got a lot of strange looks back 10 years ago, talking about this stuff. Right, I was sort of pitching like, oh, it's unstructured data, don't you get it? And, you know, Sergei Brin just said this, and you know, so. So, yeah, so people just thought that I was strange, but that may be true. But so the, so you gotta be crazy to be in big data, you know, crazy. So we had this, you know, SAS light, if you will, that's sort of, I think Josh Wills from Cloud Air called it that. So, you know, SAS light and then, you know, an unstructured workflow on top of that. And pretty quickly we realized, oh no, when you have all of this data, we can't run it all in a bigger quad-Zeon with more and more RAMs. So we literally, you know, this was early 2000s, we realized we have a problem here. So we went to the supercomputing community and we made inroads with a lot of academic institutions to build up a custom MPI layer that we could then, you know, manage our services on because ultimately for us it's complexity management in the stack. So we built this, we built this, you know, really ugly MPI architecture that worked, but it would, you know, sort of drop things. And, you know, and life was sort of hair-pulling, but we had this cool, you know, fully distributed, unstructured, event-driven platform that, you know, just throw a platform in there, you know, and it would, and it would, you know, do what we wanted it to do in terms of predictive analytics. And then sort of, you know, there's a lot of things in parallel that heated up, but right around somewhere in the late 2000s, I found out about Hadoop, and I don't think Cloudera had been started yet or was in the process of getting started. So I reached out to Doug and I reached out to, you know, indirectly had reached out to Doug and Amar at Cloudera and, you know, sort of through another guy who's now at Mapart, Ted Dunning, and I were working on, you know, distributed systems, distributed, you know, trying to meld machine learning and Hadoop. Right. And it was sort of, again, you know, this is now 2008, and it's still crazy, right? Yeah, this is still pretty early. Yeah, four or five years ago. So that was the, the idea was not, again, going back to, you know, sort of how I rambled in my opening was that this wasn't because I thought it was a cool project, it was a cool project, but we literally had to have this answer of how fast can we compute on a reduced, you know, complexity infrastructure and cost, and also how accurately could we, you know, what algorithms could we use, and we had very specific algorithm implementations we wanted to do. So that's how I sort of got involved with all of the Hadoop guys, and I was actually MapR's first customer, and, you know, before they're in beta, I got their first license, you know, and so I got that. Did Shrivis get involved? He was involved. Yeah, yeah, he was involved, he was involved. And, I don't know if John has that check, but, yeah. But, you know. He mounted on his wall, yeah. So we were just trying to build up sort of a capability. What was the magnified learnings you got out of that? Was there anything that you could, because you were in a point of time where cutting edge vision of what people want to do now, you were doing it early, but there was some infrastructure things that were happening around you that were still developing. So you probably, you know, dealing with that in real time, what was the key learnings that you got out of that that were magnified for you guys at that stage that could be applied today? Well, I mean, I think that, you know, that, again, sort of back to these high level, you know, goalposts for a lot of this infrastructure, and I'm sure that, you know, I'm not the first person to say this, that the stack is complicated, it's complex, you know, I think, you know, Jeff talked about it last night, you know, in terms of your speech, that, you know, ultimately it's a complex stack, but there's a lot of moving parts, and so, you know, you could say, well, throw more bodies at it, right, which is a lot of what large companies have been doing, and you've heard from your other interviewees that that's not always going well because of sort of this institutional knowledge of propagation is a problem, because if one guy, you know, it's the classic thing where the one guy who doesn't bathe knows how the stack works, but then the institution doesn't really propagate. So with, you know, with Cloudera, which is, you know, I'm excited that they're doing great, they were keep building more and more tools, they weren't the only ones, so obviously they were responding to pressure in the market. But they were first movers, so they had a green field. Exactly. And now competition comes in. And the competition was building newer and newer tools, and Storm came out, and you know, and obviously Spark, and so there's obviously other. Do you think there's an us against them mentality with Cloudera? I mean, Cloudera was first, and they were the pioneers, right? But now there's obviously some tension between this new open data platform and what Michael Olson's blog post talked about, was essentially saying, hey, you know, it's never going to work. It's ridiculous. So I'm paraphrasing, you didn't say that word. But that's basically what he was saying. And, you know, stay with us. It was almost like, you know, Cloudera has the right model. Cost and confusion, what's your take on that? I mean, is it? Yeah, I mean, I think it's just a, you know, it's a logical strategic move for these guys in the sense that Cloudera was first mover. They have got to move for those other guys to the consortium. Right. Cloudera was the first mover. They just raised a lot of money. And so they're, you know, in a good position. And so the other guys are in good positions as well, but they're, but separate in a separate capacity, they're not quite as strong. So it was loosely coupled, if they come together, there's gravity and they can accelerate. I'm sure the Cloudera guys are not happy about the newest thing from multiple reasons. More muscle. They're ganged up almost all the time. Exactly, yeah. So talk about, you mentioned, you know, solving problems. Ultimately, that's what this is about, right? It's not about, you know, how fast can your platform run or what cool algorithms can you run? It's about how do you solve a problem, right? So I know from, you know, having chatted with you a little bit about the Durkheim project, talk a little bit about that. Cause that's pretty compelling. I mean, that's solving a pretty, pretty compelling problem. So talk a little bit about that project and maybe what you learned there and how you can apply some of those lessons to solving other business problems. Absolutely, so, you know, in sort of answering the earlier questions, I kind of brought you guys up to about 2009 or so, 2010. Another hour, let's go. Yeah, and at that point, the Joint Chiefs of Staff sent a representative to Dartmouth-Hitchcock Medical Center looking for solutions on real-time prediction of suicide risk for military veterans and other mental health PTSD diagnoses, that sort of thing. And so, with my affiliation through Dartmouth, I was contacted to sort of build a consortium to come up with a solution to this particular problem. Now, we responded back- That was a pretty big problem. This was happening. Like there was some serious numbers of suicides with veterans. Right, and unfortunately, they've only abated slightly, if any. So, I put together a team, including Cloudera. It was led by, you know, my small group patterns and predictions. The founding research members were Cloudera for distributed infrastructure and Dartmouth-Hitchcock Medical Center, Dartmouth for medical curation and adding the physicians. And so, we actually proposed to DARPA a more of a social networks and mobile media-driven architecture. And we, you know, spent their money and we developed, further developed the platform that it sort of had been referring to. And that's sort of our most visible use cases, the Durkheim Project, in terms of solving a real problem, which is specifically in near real-time, to be exact, in a 30-second real-time label, what is the risk for any number of individuals of suicide risk based on this, you know, machine learning training data that we did with the Veterans Administration, which was a sort of a later team member. The Marquis team member that actually did the press for us in the distribution was Facebook, you know, Facebook Inc. So we, you know, actually got up there and pitched for them. So we solved a real problem with the Durkheim Project using big data, all of the bells and whistles. Right, and now the job is, now how do we apply that to, what's it going to take to move this beyond, from back to the big data question? Move it beyond talking about infrastructure and some extent analytics, but to actually applications and solving a problem, such as you did in that case. What do you think it's going to take to do that, you know, at scale for the industry? Well, so this goes back to your point of last night about data governance, that the key challenge now is social engineering, not technology. Now we have the technology, right? We have it. And, you know, it was $2 million, not $6 million, but the point is we now have this technology. It is effective, it works, it impresses the heck out of the psychologists at the Veterans Administration. That gets the renewal. But, but the key thing is, what does that intervention look like? What is the, you know, your big data needs to be grounded on some sort of social impact and some sort of, you know, proscriptive or prescriptive scenario that makes sense from a data governance perspective, from a social engineering perspective, and from a compliance perspective. Chris, we got to get wrapped up, we're getting the hook here, but I want to get you to the final word. What is your take on where we are now? And quickly, share with the folks out there what's going on right now. What's, what do you see happening? And what should people be looking at that isn't being talked about right now in this industry that's important? So, I think that you're gonna see some reasonably, some reasonable upheaval in the next couple of years as people start to automate more and more systems. So, you know, an interesting news item was when the Uber CEO was talking about how he wants to get into self-driving cars as well because now that's sort of the new hot thing is for people to throw billions of dollars in the self-driving cars. Well, and drones are now, you know, permissive everywhere. Now the U.S. Department of Defense has authorized drones to be sold, armed drones to be sold to other governments. So, you are seeing a tremendous amount of automation of daily lives and these things may or may not be scary. So you're seeing- The word geospatial means something now, doesn't it? Absolutely. And so you're just, but you're seeing a lot of what I like to call existential risks to individuals, their, you know, privacy, their freedom and potentially their lives. So, you know, that's why I try to build white hat big data systems I guess with you. Well, this is great stuff. I love this stuff. I can go on another hour. These guys want to wrap up. We're getting getting kicked out but I really appreciate it. Thanks for sharing the perspective. Great to see you again. Website blog, you wanted 2.2 for folks that want to get involved in some of these awesome projects because, you know, it's real life. I mean, changing life in society is part of the big data thing. We saw that with Obama coming on and talking about that. So, any URL? Yeah, patternsandpredictions.com or just Google patterns and predictions or Chris Poole and Cloudera. Chris, thanks for coming. I really appreciate it. This is theCUBE, we'll be right back with a wrap up for day two after this short break. Thanks for watching. Stay tuned. I'm John, Jeff Kelly, be right back.