 Hi everyone and welcome to the Big Data Deep Dive with the Cube on EMC TV. I'm Richard Schlesinger and I'm here with Tech Industry Entrepreneur and Wikibon Analyst Dave Vellante and Silicon Angle CEO and Editor-in-Chief John Furrier. For this last segment in our show we're talking about the future of Big Data and there aren't two better guys to talk about that than you and I'm glad that you guys are here. Let me sort of tee up this conversation a little bit with a video that we did because the results of Big Data Leveraging are only as good as the data itself. There has to be the trust that the data is true and accurate and as unbiased as possible. So EMC TV addressed that issue and we're just trying to sort of keep the dialogue going with this spot. We live in a world that is in a constant state of transformation, political, natural, transformation that has many faces, many consequences, a world overflowing with information with the potential to improve the lives of millions, the prospects of nations, with generations in the balance. We are awakening to the power of Big Data. We trust and together transform our future. So gentlemen, trust. Without that, where are we and how big of an issue is that in the in the world of Big Data? Well you know the old saying garbage in garbage out in the old days the single version of the truth was what you were after with data warehousing and people say that we're a further away from a single version of the truth now with all this data but the reality is with Big Data and these new algorithms you can algorithmically weed out the false positives, get rid of the bad data and mathematically get to the good data a lot faster than you could before without a lot of processes around it the machines can do it for you. So John while we were watching that video you murmured something about how this is the biggest issue this is cutting-edge stuff this is what's important. I mean trust issues and trust the trust equation right now is still unknown it's evolving fast you see it with social networks see things go viral on the internet and and we live in a system now with mobility and cloud things are scaling infinitely you know these days and so good data scales big and bad data scales big so whether it's a rumor on the you here and this is viral or other data data trust is the most important issue and sometimes big data can be creepy so a this is a really really important area people are watching it and trust is the most important thing. But you know you have to earn trust and we're still sort of at the beginning of this thing so what has to happen to make sure that you know you don't get the garbage in so you get the garbage out. It's iterative and and we're seeing a lot of pilot projects and then those pilot projects get reworked and then they spawn into new projects and so it's an evolution and as I've said many many times it's very early we've talked about we're just barely scratching the surface here. It's evolving too and the nature of the data is needs to be questioned as well so what kind of data for instance if you don't authorize your data to be viewed there's all kinds of technical issues around that's one side of it but the other side of it I mean they're bad people out there who would try to influence you know whatever conclusions were being drawn by big data programs. Especially when you think about big data sources so companies start with their internal data and they know that pretty well they know what the warts are they know how to manipulate it it's when they start bringing in outside data that this gets a lot fuzzier. Yeah it's a problem and security. I talked to a guy not long ago who thought that big data could be used to protect big data that you could use big data techniques to detect anomalies and in data that's coming into the system which you know is poetic if nothing else. That's totally happening by the way. It's a good solution. I want to move on because we really want to talk about how this stuff is going to be used assuming that these trust issues can be solved and you know the best minds in the world are working on this issue to try to figure out how to best you know leverage the data we all produce which has been measured at five X of bytes every two days and you know somebody made an analogy with like something if a byte was a paper clip and you stretched five X of bytes worth of paper clips they would go to the moon or whatever anyway it's a lot of bytes. It's a lot of stuff. Actually I think it's a lot of paper clips. Well many many times. I don't know. I lost track of my paper clips. But anyway the the best minds are trying to figure out you know how to you know maximize that that the value of that data and they're doing that not far from here where we where we sit at MIT at a place called CSAIL which was just recently set up. CSAIL stands for the Computer Science and Artificial Intelligence Lab. So we went there not long ago. It's just you know down the mass pike. It was an easy trip and this is what we found. It's fascinating. Everybody is obviously talking about big data all the time and you hear it gets used to mean all different types of things. So one of the things we're trying to do in the big data at CSAIL program is to understand what are the different types of big data that exists in the world and how do we help people to understand what different problems sort of fall under the overall umbrella of big data. CSAIL is the largest interdepartmental laboratory at MIT. So there's about a hundred principal investigators. So that's faculty and sort of senior research scientists. About 900 students who are involved. Basically with big data almost anything you do that has to be a much larger scale than we're used to. And the way it changes that equation is you have to have the hardware and the software to do the things you're used to doing. You have to make them accommodate a larger size. A much larger size. A lot of times when people talk about big data they mean not so much the volume of the data but that the data for example is too complex for their existing data processing system to be able to deal with it. So as I've got information from a social network, from Twitter, I've got information from a person's mobile phone. Maybe I've got information about retail records, transactions, a whole very diverse set of things that need to be combined together. What this query says is this says if you added this predicate to your query it would remove the dots that you selected. That's part of what we're trying to do here in big data at CSAIL and our big data effort in general at MIT is to build a set of software tools that allow people to take all these different data sets, combine them together, ask questions and run algorithms on top of them that allow them to extract insight. I'm working with a data set that was derived by NASA but the purpose of my work right now is to take data sets within databases and instead of querying them for table results you querying them to get visualizations. So instead of looking at large sets of numbers and text and whatnot you get a picture and the motivation behind that is that humans are really good at interpreting pictures they're not so good at interpreting huge tables and with big data that's a really big issue. So this will allow scientists to visualize their data sets more quickly so they can start exploring and I guess looking at it faster because with big data it's it's a challenge to be able to visualize and explore your data. I'm here just to proclaim what you already know which is that the hour of big data has arrived in Massachusetts and it's a very very exciting time. So Governor Patrick was here just a few weeks ago to announce the Mass Big Data Initiative and really I think what he recognizes and is partly what we recognize here is that there's an expertise in the state of Massachusetts in areas that are related to big data partly because of companies like EMC as well as a number of other companies in the sort of database analytics space. EMC is a partner in our Big Data at CCL initiative and Big Data at CCL is an industry focused initiative that brings companies together to work with MIT to think about big data problems help to understand what big data means for the companies and also to allow the companies to give feedback to us about what are the most important problems for them to be working on and potentially expose our students and give access to these companies to our students. I think the future will tell us and it's hard to say right now because we haven't done a lot of I think analyzing interpreting of big data we haven't reached our potential yet and I just there are just so many things that we can't see right now. So one of the things that people tell us that are involved in big data is they have trouble finding the skill sets the data science capability and capacity and so seeing videos like this one at MIT there's a new breed of students coming out there they're growing up in this big data world and that's critical to keep the big data pipeline flowing and John you and I have spent a lot of time in the East Coast looking at some of the big data companies it's almost a renaissance for Massachusetts and Cambridge and it's very exciting to see obviously there's a lot going on on the West Coast as well. Yeah I mean obviously I'm impressed with MIT and around MIT in Cambridge is exploding with young young new guns coming out of there the new rock stars if you will but in California where we're headquartered in Palo Alto you know we had a chance that we go up close to Google Facebook and Jeff Hammerbacker who will show a video in a second that I interviewed with him at Hadoop Summit he was the first guy at Facebook to build a data platform which now has completely changed Facebook and made it what it is he's also the co-founder of Cloudera the leader in Hadoop which we've talked about and he's the poster child in my opinion of a data scientist he's a math geek but he understands the world problems it's not just a tech thing it's a bigger picture. I think that's key I mean he knows he knows that you have to apply this stuff so and and and the passion that he has this video from Jeff Hammerbacker co-founder of Cloudera watch this video but and and the thing walk away is that big data is for everyone and it's about having the passion. Hammerbacker data scientist from Cloudera co-founder hacking data Twitter handle welcome to the Cube. Thank you. So you're known in the industry everyone knows you on Twitter, you're on Quora heavily follow you there at Facebook you built the data platform for Facebook one of the guys main guys they're hacking the data over at Facebook look what happened right I mean yeah tsunami that Facebook has is amazing. Co-founder of Cloudera you saw the vision Amarawadala always quotes on the Cube we've seen the future no one knows it yet that was a year and a half ago now everyone knows it so yeah how do you feel about that as the co-founder of Cloudera 40 million thousand funding validation again more validation how do you feel yeah I know sure it's exciting I think you know as data volumes have grown and as the complexity of data that is collected and analyzed has increased you know novel software architectures have emerged and I think what I'm most excited about is the fact that that software is open-source and we're playing a key role in driving where that software is going and I'm you know I think what I'm most excited about on top of that is the commodification of that software you know I'm tired of talking about the container in which you put your data I think a lot of the creativity is happening in the data collection integration and preparation stage so I think you know there was a tremendous focus over the past several decades on the modeling aspect of data so we we really increase the sophistication of our understanding you know classification and regression and optimization and all of the the hardcore modeling that gets done and now we're seeing okay we've got these great tools to use at the end of the pipe so now how do we get more data pushed through those those modeling algorithms so there's a lot of innovative work so we are thinking at the time how you make money at this or did you say well let's just go solve the problem and good things will happen it was it was a lot more of the latter you know I didn't leave Facebook to start a company I just left Facebook because I was ready to do something new and I knew this was a huge movement and I felt that you know it was very nation and and unfinished as a software infrastructure so when the opportunity with Cloudera came along I really jumped on it and I've been absolutely blown away by the commercial success we've had so I didn't I certainly didn't set out with a master plan about how to extract value from this my master plan has always been to really drive Hadoop into the background of enterprise infrastructure I really want it to be as obvious of a choice as Linux and you see what you're we've talked a lot at this conference and others about you know Hadoop moving from the fringe to the mainstream commercial enterprises and all those guys are looking at I heard JP Morgan J today where we're building competitive advantage we're saving money those guys do have a master plan to make money does that change the dynamic of what you do on a day-to-day basis or is that really exciting to you as an entrepreneur oh yeah for sure it's exciting and what we're trying to do is facilitate their master plan right like we want to we want to identify the commonalities and everyone's master plan and then commoditize it so that they can avoid the undifferentiated heavy lifting that Jeff Bezos points out you know where you know no one should be required to to invest tremendous amounts of money in their container anymore right they should really be identifying novel data sources new algorithms to manipulate that data the smartest people for using that data and that's where they should be building their competitive advantage and we really feel that you know we know where the market's going and we're very confident in our product strategy and I think over the next few years you know you guys are going to be pretty excited about the stuff we're building because I know that I'm personally very excited and yeah we're very excited about the competition because number one more people building open-source software has never made me angry yeah so so you know that's kind of the marketplace so you know we're talking about data science you're building a data science team so first tell us where Gerald drill into data science talk about that what you're doing at Cloudera around data science your team and your goals and what is a data scientist I mean this is now a new you know is it the DBA for Hadoop or you know what you know sure sure so what's going on yeah so you know to kind of reflect on the genesis of the term you know when we were building out the data team at Facebook we kind of had two classes of analysts we had data analysts who were more traditional business intelligence you know building can reports performing data retrieval queries doing you know lightweight analytics and then we had research scientists who were often PhDs and things like sociology or economics or psychology and they were doing much more of the deep dive longitudinal complex modeling exercises and I really wanted to combine those two things I didn't want to have those two folks be separate in the same way that we combined engineering and operations on our data infrastructure group so I literally just took data analysts and research scientists and put them together and called it data scientist so that's kind of the the origin of the title and then how that's translated and what we do at Cloudera so I've recently hired two folks into a a burgeoning data science group at Cloudera so what the way we see the market evolving is that you know the infrastructure is going to be commoditized. So what's the mindset to really be a data scientist and you know what is we should be thinking about I mean there's no real manual most people are bored with math skills, economics and these kinds of disciplines you mentioned what should someone prepare themselves how do they approach it how does someone say hey I want to hire a data scientist how do I fold the rec form yeah these kinds of things. Well I tend to you know I played a lot of sports growing up and there's this phrase you know of being a gym rat which is someone who's always in the gym just practicing whatever sport it is that they love and I find that most data scientists are sort of data rats they're always they're always going out grabbing new data so you're there's a genuine curiosity about seeing what's happening in data that you really can't teach but in terms of the skills that you that are required I didn't really find any one background to be perfect so I actually put together a course at University of California Berkeley and taught it this spring called introduction to data science and I'm teaching it in teaching it again this coming spring and they're actually going to put it into the core curriculum in the fall of next year for computer science. All right Jeff Hammerbocker thanks so much for that insight great epic talk here on the Cube another another epic conversation shared with the world live congratulations on the on the funding another 40 millions great validation and congratulations for essentially being part of the data science and finding that whole movement Facebook and and now with Amaral Adala and the team at Cloudera you did a great job so congratulations on all the competition keeping your capitalism right okay it's great isn't it that with all these great minds working in this industry they still can't we're so early in this that they still can't really define what a data scientist is I mean what is talk about an industry in its infancy that's what's so exciting everyone has a different definition of what it is and that what that means is is that it's everyone I think data science represents the new everybody it could be housewife it could be a homemaker to an eighth grader it doesn't matter if you see an insight and you see something that can be solved data is out there and I think that's the future. And Jeff Hammerbocker talked about spending all this time in technology with undifferentiated heavy lifting and I'm excited that we are moving beyond that into you know essentially the human part of big data and it's going to have a huge impact as we talked about before on the productivity of organizations and potentially productivity of lives I mean look at what we've talked about this afternoon we've talked about predicting volcanoes we've talked about you know medical issues we've talked about pretty much every aspect of life and I guess that's really the message of this industry now is that the folks who are managing big data are looking to change pretty much every aspect of life. This is the biggest inflection point in history of technology that I've ever seen in the sense that it truly affects everything and the data that's generated and the data that machines generate the data that humans generate the data that forest generate things like everything is generating data so this is a time where we can actually instrument it so this is why there's massive disruption in this area. And disruption we should say the uninitiated is a good thing in this business. Well creation, entrepreneurship, copies of being founded it's a great opportunity. Well I appreciate your time I unfortunately I think that's going to wrap it up for our big data deep dive. John and Dave the Cube guys have been great I really appreciate you showing up here and you know just lending your insights and expertise and all that and I want to thank you the audience for joining us so you should stay tuned for the ongoing conversation on the Cube and to EMC-TV to be informed inspired and hopefully engaged. I'm Richard Schlesinger thank you very much for joining us.