 Jeff Hummerbacher, data scientist from Cloudera, co-founder, hacking data, Twitter handle, welcome to theCUBE. Thank you. So, you're known in the industry, I'll see everyone knows you on Twitter, you're on Quora, heavily, follow you there. At Facebook, you built the data platform for Facebook, one of the guys main, guys there hacking the data over at Facebook. Look what happened, right? I mean, the data tsunami that Facebook has is amazing. Co-founder, Cloudera, you saw the vision. Amarawadala always quotes on theCUBE, we've seen the future, no one knows it yet. That was a year and a half ago, now everyone knows it. So, how do you feel about that? Is it co-founder, Cloudera, $40 million in funding, validation again, more validation? How do you feel? Yeah, I know, I'm sure it's exciting. I think, as data volumes have grown, and as the complexity of data that is collected and analyzed has increased, novel software architectures have emerged. And I think what I'm most excited about is the fact that that software is open source. And we're playing a key role in driving where that software is going. And I think what I'm most excited about on top of that is the commodification of that software. I'm tired of talking about the container in which you put your data. Spending the majority of your budget for analytics on the container itself, rather than the people and processes to derive business value from that data was always very frustrating to me. So I'm very excited that everyone's gonna have a level playing field in the same way that the operating system is a level playing field today with open source Linux. Hadoop is becoming the Linux of large scale data management and companies are going to begin to differentiate themselves on what they do with that container. We're going to talk about data science in particular, Dave and I have been talking about that since EMC World earlier in the year when their whole campaign was around data science. But I want to ask you more on a more of an entrepreneurial note, because I know you're heavily involved in core, you're a big participant in that community, involved in startups, obviously with Cloud Air and in the ecosystem. What kind of creativity are you seeing being unleashed with the data? I mean, obviously data is liberating. I wrote a post two and a half years ago, data is the new development kit where developers would essentially hack data and that's the development environment. What kinds of creativity, and we're talking with Mike Olson and with the big data fund, the old adage VCs, it's a feature, not a company. But that's not the case anymore. A feature absolutely can be a company now because one creative idea could explode. So what kind of creativity are you seeing out in the marketplace from entrepreneurs and companies? I think a lot of the creativity is happening in the data collection, integration, and preparation stage. So I think there was a tremendous focus over the past several decades on the modeling aspect of data. So we really increased the sophistication of our understanding of classification and regression and optimization and all of the hardcore modeling that gets done. And now we're seeing, okay, we've got these great tools to use at the end of the pipe. So now how do we get more data pushed through those modeling algorithms? So there's a lot of innovative work in the R community. The way I use R has really changed a lot since I first started using it five, 10 years ago with a lot of the tools that Hadley Wickham has built for data manipulation. So Plyer and Reshape have been very influential in terms of how I navigate data in R and also in Python with the emergence of things like NumPy and Pandas providing similar capabilities for data manipulation. You know, that's very exciting. And I think you're starting to see some emerging tools for data integration. So tools like Google Refine and the Wrangler project from Berkeley, as well as Clutter as a record breaker for schema inference, I think are gonna facilitate the transition from a novel data set for which you don't know the structure into a well-structured data set which is amenable to modeling. So how did you approach it? You mentioned before, Jeff, that you were frustrated that all the money was going into the container and not the people in process around the container. That was, you were speaking as a practitioner, I presume. Correct, yeah, yeah. And so you had a passion, I guess, to solve that problem. Were you thinking at the time, I mean, because it's kind of non-intuitive how you make money if you get rid of the container, commoditize the container. So were you thinking at the time how you make money at this? Or did you just say, well, let's just go solve the problem and good things will happen? It was a lot more of the latter. I didn't leave Facebook to start a company. I just left Facebook because I was ready to do something new. And I knew this was a huge movement and I felt that it was very Nash and unfinished as a software infrastructure. So when the opportunity with Cloudera came along, I really jumped on it and I've been absolutely blown away by the commercial success we've had. So I certainly didn't set out with a master plan about how to extract value from this. My master plan has always been to really drive Hadoop into the background of enterprise infrastructure. I really want it to be as obvious of a choice as Linux. And then, yeah, I kind of assumed that once you have that kind of ubiquity, the importance of data management and analysis is so clear that that money, we had Martin Mecos as an investor in Cloudera and he came in and talked early on to the company and he had a really great line where he said at MySQL we were trying to take a $9 billion industry and turn it into a $3 billion industry but come out on top. And that's not dissimilar from our goals at Cloudera. And you see, we've talked a lot at this conference and others about Hadoop moving from the fringe to the mainstream commercial enterprises and all those guys are looking at it. We heard Pete Morgan-Jayce today where we're building competitive advantage, we're saving money. Those guys do have a master plan to make money. Does that change the dynamic of what you do on a day-to-day basis or is that really exciting to you as an entrepreneur? Oh yeah, for sure it's exciting. I mean, what we're trying to do is facilitate their master plan, right? We want to identify the commonalities in everyone's master plan and then commoditize it so that they can avoid the undifferentiated heavy lifting that Jeff Bezos points out. No one should be required to invest tremendous amounts of money in their container anymore, right? They should really be identifying novel data sources, new algorithms to manipulate that data, the smartest people for using that data and that's where they should be building their competitive advantage. So what are those commonalities that you see? Presumably that's where Ping Lee is going to be putting a lot of his money. Talk about that a little bit. Yeah, obviously a lot of the commonalities that we see are expressed in the software we build. So we've seen people not just deploy HDFS as a container and MapReduce as a processing infrastructure but they also deploy facilities for both immutable table structure data storage, so things like Hive, as well as mutable table structure data storage with HBase, which has really been driving a lot of our largest engagements, I will say. But in addition, it's not just the container and structured data storage, it's also the higher level tools for authoring processing tasks. So we recently open sourced a system called Crunch, which enables complex workflows in Java that's modeled after Google's very successful Flume Java project. And we've also brought on board the creator of Uzi for workflow description and management, which is heavily used in a lot of our customer base. So we've identified those tools on top and we've also built the tools for integration. So tools like Flume and Scoop, I think we're fairly unique bets by Clutter in terms of saying, we don't want to just build you a container, we also want to build you the tools to populate that container. And we've also invested a lot in ODBC drivers, Fuse modules, so we allow people to consume, to both perform data ingest reliably with open source tools, as well as to consume data stored in the container with their existing tools, with which they're comfortable. Yeah, so, you know, we love, we had- Unleashing data. Unleashing data. Just listening to you talk we love to talk about competition, you know, it's interesting. Sure. And I'm around before, you guys seem so happy about it. I mean, not so sanguine about it, not freaking out, but when you, when I hear you describe the innovations that you're developing, I mean, it underscores the effort that you put into it. Yeah. And it's a long way to go. It's not trivial, is it? Sure, I mean, one of the reasons we went out and raised $40 million, is we genuinely believe that the market is going to support multiple billion dollar companies, and we want to make sure that we're there first and we're leading. We really feel that we have a pretty insurmountable advantage in this market. We've been doing it for three years. We've got well over a hundred customers. Aamir and myself were end users of the technology before we even started the company. So we have a distinct empathy for our customers. And we really feel that, you know, we know where the market's going. And we're very confident in our product strategy. And I think over the next few years, you know, you guys are gonna be pretty excited about the stuff we're building because I know that I'm personally very excited. And yeah, we're very excited about the competition. Because number one, more people building open source software has never made me angry. You know, I think, you know, I fully support, you know, there's a lot of very great, very solid engineers working at places like EMC, IBM, and Hortonworks that are contributing in a really significant fashion to the platform. So that's, you know, those guys are not the enemy. But more importantly, we want to know that the market will sustain more than one player, right? So the worst, the biggest risk you take as an entrepreneur is market risk. Understanding that there's going to be consumers to support the thing that you want to build. And I, you know, I think it's fabulous that not only- It's complete validation. I mean, literally complete validation. You've got this Hadoop world, really is the cherry on top of the ice cream because it just crosses over with the business awareness of the problems, right? Like when you've got business practitioners in the audience of a geek talk. Yeah, yeah. That's pretty cool. I mean, that's happening. Yeah, no. So you got to market. Yeah, precisely. And you know, shame on us if we don't leverage this opportunity to produce, you know, some of the greatest open source. So, you know, I'll tell our guys that I really perceive this, you know, over the next two years, Clutter has really got a chance to become the second most significant commercial open source software vendor that's ever been built, right? Yeah. And I, you know, we'll see about take it. Yeah. I think it's already the fastest growing project in Apache history. So that's- Yeah, exactly. And obviously Red Hat's done an incredible job with Linux and they're, you know, they're number one and they're going to be for a while. But, you know, that to me, I reflect on the historical significance pretty regularly and I'm very excited to be part of it. And I know our guys are as well. Let's talk about, we had the guy from Yale on, Adapts, Adapts founder, Total Geek, Total Geek and now I've done three companies, this guy's just like playing in a candy store, building stuff. As a teenager. Yeah. So, you know, that's kind of the marketplace. So, you know, we were talking about data science. You're building a data science team. So, first tell us, before we drill into data science, talk about what you're doing at Clutter around data science, your team and your goals and what is a data scientist? I mean, this is, you know, is it the DBA for Hadoop? Or, you know, what, you know. Sure, sure. What's going on? Yeah. So, you know, to kind of reflect on the genesis of the term, you know, when we were building out the data team at Facebook, we kind of had two classes of analysts. We had data analysts who were more traditional business intelligence, you know, building canned reports, performing data retrieval queries, doing, you know, lightweight analytics. And then we had research scientists who were often PhDs and things like sociology or economics or psychology. And they were doing much more of the deep dive longitudinal complex modeling exercises. And I really wanted to combine those two things. I didn't want to have those two folks be separate. In the same way that we combined engineering and operations on our data infrastructure group. So, I literally just took data analyst and research scientist and put them together and called it data scientist. So, that's kind of the origin of the title. And then how that's translated and what we do at Clutter. So, I've recently hired two folks into a burgeoning data science group at Clutter. So, the way we see the market evolving is that, you know, the infrastructure is going to be commoditized. You know, you're going to get open source distributed data management and analysis tools from the vendor who provides the best documentation, support, training, certification, services, et cetera. And we firmly believe Clutter is that vendor, but that's not going to be your comparative advantage. We're going to have thousands of customers soon, right? So, your comparative advantage is going to come from how you use those tools to derive business value. And so, we've gone out and we've hired folks from Google, you know, from Yahoo Research, my background at Facebook. And these are the people who are the best in the world at using these tools. They've been doing it for, you know, up to a decade. And so, when you look at the Google tools which came before Hadoop is open source. So, what we're trying to do is, you know, really go in, understand our customer base, extract their use cases, identify commonalities, and then build example applications which demonstrate both to people who are not yet Clutter customers as well as who are existing Clutter customers, how that they can derive business value from our software infrastructure. So, we, you know, we've got two folks now, we're growing to six in the first half of next year, and we spend a lot of our time just talking to customers and writing code. Same kind of profile you were talking about, the mix of, is there a certain profile in the hiring? Yeah, yeah, mostly I'm looking for people who've done it before, who've taken, you know, a large-scale distributed file system and map-reduced infrastructure, as well as the attendant tools for data collection, data structuring, and analysis, and they've built complex solutions which drive, you know, millions of dollars in revenue generation. And so, I'm taking people who've done that before and having them ingest other businesses' requirements and generate code, which will do that for them. So, yeah, we look for a mix of, you know, engineering excellence as well as a background in statistics and machine learning to be able to understand the modeling algorithms, but most importantly, you know, the communications capacity to interact with a business user, ingest their requirements, and translate it into software which will meet those requirements. You're really acting as a business, I mean, an accelerant for your customers, right? Oh, yeah, for sure. That's exactly what we're trying to do is to close the gap between, you know, it's a lot to comprehend. Like, even the distributed file system and map-reduced, which are two of, like, you know, the 16 open source projects that we distribute to them, even those are hard to grok for our customers. So, to expect them to not only understand how the technology works, but then to think through how to apply that technology to their business is a bit too much early on. So, we've really invested in performing that function for our customers and then training them to do that themselves. Let's talk about that, because that's, you know, priming the pump, obviously, because the demand is so high for the solutions, they only have the personnel in place. What's the mindset for the folks that are looking to either your customers, for other customers that do customers, or engineers and developers, that matter, who are new to the game, who might not have done that before, because, you know, if that's the criteria, you're going to hire the best of the best. But soon, you know, you've got everybody and it's like, well, I've never done it. You got to do it. Someone's got to start, right? So, what's the mindset to really be a data scientist? And, you know, what is, what should we be thinking about? I mean, there's no real manual. Most people are bored with math skills, economics, and these kinds of disciplines you mentioned. What should someone prepare themselves? How do they approach it? How does someone say, hey, I want to hire a data scientist? How do I follow the rec form? Yeah, yeah. These kinds of things. Well, I tend to, you know, I played a lot of sports growing up and there's this phrase, you know, of being a gym rat, which is someone who's always in the gym just practicing whatever sport it is that they love. And I find that most data scientists are sort of data rats. They're always, they're always going out, grabbing a new data, you know, they hear like, oh, Yandex has a new data set and they've got a competition for it. So, I immediately go and download that data set, you know, pull it into pandas and Python and then just, you know, manipulate it. Just check out what's going on. And so there's a genuine curiosity about seeing what's happening in data that you really can't teach. But in terms of the skills that are required, I didn't really find anyone background to be perfect. So I actually put together a course at the University of California, Berkeley and taught it this spring called Introduction to Data Science. And I'm teaching it again this coming spring and they're actually gonna put it into the core curriculum in the fall of next year for computer science. And so what I really try to do is break down what are the things which I see people use frequently in practice, which are not taught well in the undergraduate curriculum. So the five components of that Introduction to Data Science course are number one, data collection and integration. So, you know, oftentimes in a machine learning or statistics class, you're handed a perfectly cleansed data set. You're not actually asked to go out and acquire that data set and integrate it with your existing data. And so I find that's a core skill that isn't taught well. The second component is visualization design, particularly dashboard design. So once you've kind of collected or integrated a data set, the first thing you wanna do is see what's going on in there. And so visualization is still remarkably not taught at a lot of universities, but even when it is taught, it's often taught more on just the chart design. So we're trying to go beyond chart design and also go into dashboard design. So guys like Steven Few make a lot of money by teaching people how to do this in seminars. And I think integrating it into the undergraduate curriculum makes sense. The third component of the course is on large scale experimentation. So most large web properties have a sophisticated A-B testing infrastructure. And they're able to rapidly design new features and then deploy them to be tested. And so they define certain objective functions that they wanna see how feature A performs against feature B and make a decision based on the data about what should apply. So we talk about what that looks like in practice. The fourth- Or simulations and stuff? What not? You know, standard hypothesis testing, which is often not taught as much in the undergraduate statistics curriculum as sort of distribution design. You know, things like T-Tests are taught. So putting a T-Test in context, what does it look like to actually deploy that? The fourth component of the course is on causal inference and observational studies. So the majority of data, you know, I had a non-linear dynamics professor who started off a course in college once by saying dividing the world into linear and non-linear dynamics is like dividing the world into bananas and not bananas. And that's kinda how I feel about experimentation versus observational studies. Everything is an observational study. It's very rare that you get to control the assignment of treatments to subjects. You're often essentially handed those assignments and forced to do as much causal inference as possible. And I've often found when people say we found nuggets in this data, what they actually mean is they've performed some form of causal inference and they're able to say that if we do X, then Y will happen. So I try and teach sort of emerging techniques, you know, guys like Judea, Pearl in the late 80s and early 90s really made huge strides in how to do causal inference and observational studies. And those methods are just now finding their way into everyday social sciences. So I try and teach that to the folks. And then the fifth and last component of the course is on data products. So it's about, you know, oftentimes people know how to fit machine-learned models. But once you have that model, how you deploy that into production and then how do you set up a regular refresh cycle and how do you evaluate the performance of that model once it's in production? So things like people you may know. This is the classic cross-disciplinary trend that's required now. I mean, you have, it used to be your grade at math, sitting as chair, performing these functions, great. Now you really need to have this cross-discipline, especially in CS too, you know? Yeah, there's really a focus, I think, on staying, you know, hewing close to reality, staying close to the data. You know, when I first went down to Wall Street and worked as a quant, not that far from here, back in 2005, you know, my boss sat in a room with a whiteboard and a drawer full of papers. And that's how he did his job. Whereas today, I think that people who are really driving innovation on Wall Street are doing their job by, you know, gathering data sets and interacting with them in an iterative fashion using tools like R. So I think that we really over-rotated on complex modeling and under-rotated, you know, data munging and data analysis. Yeah, and you added, I guess it's part of five, but you and your team are looking for people who've actually done it before and generated what you said is millions of dollars and value from this, which I guess, as I say, a component of data products, but what sports did you play growing up? I played baseball in college. I played a lot of baseball, too. I never met anybody who actually spoke like you did on the baseball team. Yeah. My final question, it was great to have you in the queue. This is fantastic. You can do a whole hour segment on this stuff, it's great. You know, obviously, Silicon Angles mottos where computer science meets social science. So we love this stuff and we totally agree with you 100%. My final question is more around the corner. What do you see happening around the corner, okay? So let's just kind of walk through cloud era's position, the dupes growing, data's now at the rage. Everyone's rocking the data and they're playing with it. Customers are growing, cloud's growing, mobile's growing, social's growing. So society's changing. What's going to happen with the data? How do you envision the future? Yeah, I think other than closing the obvious enterprise gaps in terms of business continuity, high availability, security and encryption. So those are the obvious things which are going to come down the pipe. I think that we're really going to add the facility to perform interactive data analysis and not just batch data analysis to the platform. And then I think there's going to be tremendous innovation in the data integration and preparation side. So when I sat down with the largest customer of SAS, and they have well over 1,000 seats. And SAS, of course, being one of the most powerful analytical tools in the market, I was hoping to learn how they performed analysis. And what they told me was very surprising. They said, I would estimate that greater than 90% of our usage of SAS licenses is for ETL and data manipulation. So that, to me, is the great, that's the huge problem to be solved in this space. I think that things like model selection are going to go the same route as access path selection and databases. So if you go back to the 60s and 70s, people are writing their own algorithms for joining data sets and then choosing between algorithms to perform those joins. And then eventually, that just gets put into, that's decided by the computer today. And so I think the same thing is going to happen in terms of model selection with data analysis, where you're going to see, we're not going to be debating the merits of gradient-boosted regression trees versus my favorite lasso or something like that. All of those parameters of the variety of modeling algorithms will be handed to the computer, and the computer will make that decision. So the really interesting piece for human beings is going to be on identifying new data sets integrating those data sets with your existing data and then performing the feature engineering task of what do we feed into this modeling algorithm and then which problems do we point it at. And that then leads you upstream to the problem of, well, to generate more data, we need to instrument more of the world. So I think there's going to be a huge boom in instrumentation. So you're seeing this in things like the quantified self-movement. So the more that we measure, the more that we can analyze. The future is really moving towards ubiquitous measurement plus data integration and preparation tools. All right, Jeff Hammerbocker, thanks so much for that insight, great epic talk here on theCUBE. Another epic conversation shared with the world live. Congratulations on the funding, another 40 million, it's great validation. And congratulations for essentially being part of the data science and finding that whole movement, Facebook, and now with Amar Awadala and the team at Cloudera, you've done a great job. So congratulations. Thank you very much. All the competition. Keeping you up, keeping you open faster. It's capitalism, right? All right, keep it in your arms. Okay, we'll be right back.