 Hi everyone and welcome to the Big Data Deep Dive with The Cube here on EMC-TV. I'm Richard Schlesinger and I'm here with Tech Industry Entrepreneur and Wikibon Analyst Dave Vellante and Silicon Angle CEO and Editor-in-Chief John Furrier. And we're talking about those folks who really break ground in the big data area, the big data pioneers. So welcome to you both, thanks for coming by. It's always a pleasure to see you. There are a lot of pioneers out there because this is a big developing industry with a lot of money to be made. And I guess we're just beginning to imagine the potential of this space. And you know, I mean you guys have been around talking to pretty much everybody. What are you hearing about the real pioneering efforts going on? Well I think the pioneers are driven out of the open-source movement of software and that's really where I'll start on this big data space because the database is in the technical world has been, really action has been, the ability to store data and that's the core issue. And from there the software developers and then now the PhDs and the math geeks with big data have really taken it to another level. So you're seeing at a technical level the programmers and then on the business side the practitioners who are putting it to work. So it's really kind of a combination of two sets of folks. So Dave, where are we in this pioneering era? I mean are we at the beginning do you think? Do people understand the limits or is the sky the limit? I think the sky is the limit. It's the first inning Richard if I can use the baseball analogy and I think John's right. I mean you've got the alpha geek world and the business world are colliding in a really interesting way and what the alpha geeks are doing is solving the problem of how do I take all this unstructured data and bring structure to it, all this exhaust, this data exhaust and actually analyze it and bring structure to it and then the second dimension that we hear from practitioners in the Wikibon community is how do we then create value out of that data? How do we package it? How do we monetize it? How do we distribute it and put it out there? You know the numbers of terabytes and petabytes and other kinds of bytes that they're just making up names for it's astonishing. I mean we were talking about numbers that people haven't really experienced before, encountered before. So EMC TV we produced a video to give a sense of the scope of the challenges and the opportunities of big data. I'd like to show that to you. Take a look. Overall the data market just continues to explode. In the past people might have thought about you know five or ten terabyte environments as big. When we talk about big data now we're having customers that are putting up multiple petabytes of data that they're trying to bring together to analyze and increasingly it's not just structured data but also unstructured data as well with multiple data sources and multiple data types. If you look at any company probably the preponderance of data that's managed within any organization sits in today siloed data warehouses. Service has their own, support has their own, sales has their own. You have repetitive redundant amounts of data scattered around inaccessible from anybody except for those who know where to look and know what to ask and know how to go about it. So the genesis of Green Plum was tackling that problem really hiring the world's best database experts here in the Bay Area and getting really exciting about going after the innovation in this space leveraging commodity hardware. Green Plum's technology was at a point where we could leverage on a lot of the emerging trends that were happening and computing so that we literally could make the process of storing and analyzing data ten times faster and more cost effective. And anytime you can do something that's not incremental 20 or 30% better but you can go to a customer and say look this is going to run maybe 10 to 50 times faster than you can do it today. It's a complete mind shift in how they think about storing and using data. Green Plum allows you to not only virtualize the data warehouse instance for example or federate all of the data warehouses but it provides a common interface or common language for humans by and large to be able to ask their question in real time. Essentially be able to make sure that that question is addressed to every piece of data within the organization that's within the Green Plum domain. And so it's really exciting for us to be here. If you think about how business is going to evolve, business is going to evolve from being data centric inside their enterprise of solving the problem that you have today. How do I get access to all this information to now my data that I have is actually valuable outside of this organization. Businesses are spending a huge amount of money in their infrastructure to get and store this data. That's already a spent investment. Now just being able to analyze it allows them to get business value from the data they're already creating at a tremendous rate. There is this idea that knowledge workers are generating information that can be captured on a real time basis. And you could capture all of that that's coming in and then use that potentially to serve up another data mart or parse through that as well. So we see across sectors that information is becoming sort of the new business battle ground or the new currency to sort of drive how effective a business will be. I love this notion of information being the new battle ground in business. I wonder, you know, how how many CEOs understand this idea or understand the explosion that's either happening or about to happen in big data? I think increasingly they're starting to understanding and putting forth mandates to their organization to try to identify ways in which they can use data as big value as new value. I think people have looked at it for a long time as a lot of wasteful effort around making stuff work that didn't give differentiable advantage. And I think that's now changing with big data. People are looking at how you can write algorithms and get new data sources. And as I was saying before, package things in a way that do bring competitive advance. Yeah, CEOs have had a real big moment with big data. And that was the iPad. When the iPad came out, executives can actually hold an iPad and see what real time instrumentation or analytics can do for their business. And they said, I want this product. So it forced it to design for the iPad. And so big data really helped that. And on the consumer side, the iPhone really was the beginning of what has become obviously a consumer market with big data, whether it's social media or something else. But it's almost like a light went on, you know, when they realized we have all this information anyway, if we could figure out a way to analyze it, apply this analytics as you guys like to say to that to the data, we could, you know, we could be sitting on a goldmine here. Well, not only the goldmine, as Dave mentioned about top line revenue versus the cost cutting angle, it's really become a competitive advantage. So the smart CEOs out there are looking at the big data opportunity as a competitive advantage and to re-architect their businesses using big data to actually have an advantage across the board in every part of their business, hiring, supplying customers with products, collecting money, all kinds of things. So it affects all aspects of business. Absolutely. One of the things we've been tracking in Wikibon is it's not trivial from an organizational standpoint. You have to organize to create business value out of this. So we live in a world of silos. We heard that on the on the video from EMC TV. And I think they're going to have big data silos. I mean, that's just the world we live in. But the key is, how do you organize data so that it's a resource that all these silos can take advantage of? And everybody's talking about big data now, too. So there'll be more. I guess you can't really talk about big data for too long without talking about Hadoop. And I know that you guys have been around that for a lot. Talk to me about that. Well, Hadoop is exploding. And the amazing thing to me, John, is it came out of, and you've interviewed Doug Cutting a number of times, but it came out of a research paper that was written by Google. And Doug Cutting read it and said, wow, I could apply that to a commercial open source enterprise type of operation. Okay, so let me do this. Hadoop for idiots. Just one sentence. What does Hadoop allow you to do? So back in the day, if you had a big data problem, what you would do is you'd buy a big box, you know, the UNIX box. And you would put as much data into that box as possible. And that box was your data temple. Well, when Google had to index the web, it realized, well, we can't put this into a box. We can't put it into a structured database. So we have to leave the data where it is. Pedabytes and pedabytes of data, leave it where it is instead of trying to force it through a big pipe and just bring five megabytes or gigabytes of code to the data. And that really is the concept that said Hadoop off, right? Yeah, I mean, Hadoop at the bottom line, if you want a bumper sticker it is. It's the ability to store massive, massive amounts of information that you don't know what you're going to do with it yet, but store it in a way that you can get at it really fast in a very low-cost way. And that fundamentally requires a different technology. But that's the game changer of Hadoop. The ability to store at a very, very low cost and get it very, very fast. And so you are a big fan of Hadoop and the man who created it. You recently talked in, why don't you tee that up? This is an interview that we did with Doug Cutting, the founder of Hadoop. And then I see now open source with Apache. I was with Cloudera for a year and a half in their offices in Palo Alto. And it's my interview with Doug Cutting. Hope you enjoy it. The father of Hadoop. And a rock star. He's a rock star. Yeah, he's a rock star. Okay, we're back here at Hadoop Summit 2012. I'm John Furrier, the founder of SiliconANGLE.com. I'm joined by my co-host for this segment is Jeff Kelly, the lead analyst at Wikibon.org on big data, the best big data analysts on the planet. Obviously Dave Vellante can't be here, Jeff. So you're going to sub in for his spot. I'm playing Dave today. I'm super excited to be here because one, I love this ecosystem of Hadoop. And it's just a lot of the friendly faces that we've seen over the years. And our next guest, Doug Cutting, is one of those friendly faces from my time when I was sitting in the Hadoop office, Cloudera office, or Hadoop office, the Cloudera office where Doug would come in a couple times a week. Doug, welcome back to the Cube. John, good to be here. You've been on many times. You're the founder, inventor, co-founder, or inventor of Hadoop as you're being known as a celebrity. I knew you when you were just a cloud, you know. I wanted to get your perspective on the future of Hadoop. You've been involved from the beginning. You're in the community. You're at Cloudera. What's going on? I mean, what's your view right now of Hadoop as it is and where is it going? We're seeing tremendous growth. We're seeing industry after industry start to realize that this is a way that they can improve their businesses. They have data that's passing through their hands that they can benefit from if they could get a handle on it, if they could save it and analyze it effectively, and that Hadoop can help them with that, can provide them the tools. So it's pretty exciting to see that and you know predictions, the projections are huge. Can you talk about some of the dynamics going on right now because obviously the environment's changed. I think there's so much to be done that the only way to really do it effectively is to collaborate. We listen to our customers want and try to make sure that we're making them happy and not look to competitors. I mean the basic platform, we're adding new really needed features. The high availability stuff makes the real-time nature of H-Base as an online store useful if you can rely on it being up 24-7, and now you can with it with the current releases. So from those real fundamental core layers, there's still a lot of fit and finish work at the outside, making it really easy to incorporate new data sets, to visualize results, to deploy and monitor these clusters. All these things need a lot of work. It's a young technology still and it's getting more mature. I think the key advantage of H-Base, there's a number of distinctions between it and other quote-unquote no SQL data stores, but I think the key advantage of H-Base is just this degree of integration with the rest of the Hattib ecosystem. So I think that's something I look for when we're sort of trying to figure out what are the next major components to join the ecosystem, is how well can they integrate with what's there already. Because you want to make things seamless, you want to make moving from one tool to another as easy as possible. You don't want to have to be importing and exporting your data, you want to be able to access it natively from one tool to another. So that's the direction I think we really ought to be pushing the ecosystem. Tell us, tell the folks out there, Doug, now that you're a big-time celebrity and getting bigger every day, and you're tall too as well, what you're working on, what you're working on right now, I mean primarily in terms of your focus, and what you're excited about right now. You know, I've got three things that I tend to spend my time on. I'm the chairman of the Apache Software Foundation, so Cloudera donates my time, you know, roughly a third of my time, to volunteering at Apache and trying to keep things running smoothly there as best I can. I do a lot of sort of at work as a spokesman for Cloudera and for Apache, so I spend time out on the road talking with folks, you know, and if you spend a day on the road, there's days on either side preparing and recovering from that, so that's a big time sink to do that work. And then I'm still working on code, still you know, it's still hacking. So I wonder, you know, Wikibon just put out a report around kind of the enterprise readiness of the Duke. So if you could, what are maybe the one or two key areas of improvement that you've seen over the last, I don't know, six months to a year around making the system, you know, ready to uptime, security, ease of use, what are the kind of the key barriers? Well, I mean, Cloudera's been working in lots of areas, contributing to lots of projects, building commercial products to help folks run Hadoop in production and make that really seamless and smooth and easy. The community at large, I think probably the largest single thing is the high availability in HGFS. We'll probably see you at Hadoop World in New York, but between now and in that event, what's your key goal and how do you see the Hadoop ecosystem in your preferred future? What is Hadoop ecosystem going to look like? I mean, I just see, I see it's really trying to fulfill this promise that's out there. People have these great expectations, and so we need to meet them. We need to meet the customers, the users, find out what their problems are, how this isn't working for them, and make that happen. You know, we've got the Hadoop 2.0 CDH4, it's out in the field, Cloudera released that last week, and I think over the next six months we'll see widespread adoption of that in production, and that's very exciting. See H-Base exploding? See H-Base is going to continue to explode. I think I think the the 2.0 stuff really helps H-Base a lot. There's a whole lot of performance work that went into HGFS and MapReduce that we'll see the benefit of. Yeah, H-Base is just incredible. It's taken off and we'll see what we've got. It's fun to watch. Well, thanks for all your help on the Cube. You've been a great citizen. You've been great to come on. We love having you on. We knew you went back in the day, and also Cloudera has been a great supporter of my mission at Silicon Angle, and Mike Olson and Armour have enabled that, and you guys have been very good on that, so I want to thank you for that. Doug Cutting on the Cube, we'll be right back with more news after this break. So Doug Cutting, one pioneer, one rock star, if you will, John, but there are others out there who you've talked to. Who else do you have? Yes, all the pioneers, so we at EMC World, we interviewed Alpine Data Labs. Now, here's a company who's basically taking all this unstructured data and bringing structure to it and allowing you to basically work on large corpus of data very quickly and easily. So let's take a look at my interview with Steve Hillion of Alpine Data Labs. Okay, we're back. This is the spotlight on data science and big data. This is Dave Vellante of Wikibon.org, and I'm here with my co-host Jeff Kelly, and we're live at EMC World 2012, and the theme of the show is transformation. We've talked a lot yesterday about cloud being the IT transformative piece, but really it's data is transforming business, and it's all about packaging information, monetizing information, getting value out of information. That's the business transformation. Of course, there's also a big transformation of skill sets, and EMC is talking a lot about that, but today we're talking about really the business impact, the data, the value of the transformation. We're here with Steve Hillion, who's the Chief Product Officer of Alpine Data Labs. Steve, welcome to the Cube. Dave, nice to meet you. Great to have you on, and so Chief Product Officer, I asked you off camera, is the product data, and you said yes. Talk about that a little bit. The product really, as you were saying, is sort of the insights coming out of that data. I think increasingly organizations are either turning their data into value or worrying that they may not be doing that enough. They've got these mountains of data that are piling up, all the traditional data that they've been getting out of their transactional systems, but now increasingly machine generated data, web behavior, just piling up, and the sense of how do we make the most value out of that? How can we use that to really understand our customers better? And that's really what we're about, so turning that into deeper analytics then may have traditionally be done in the past, getting real value. So what does Alpine actually sell? So we sell an application that allows you to do predictive analytics on really large quantities of data without having to set up a massive new infrastructure. In fact, what you can do is you can download our product literally off the web, point it at the source of your data, typically going to be sitting in a relational database like Green Plunk for example, and then just start doing predictive analytics. Okay, but so do you have the capability to essentially on the fly build that model within the database? Yeah, that's right. So this actually came out of that's exactly what we do. Sounds like magic. Well it certainly took a lot of hard work and in fact I can't claim all the credit ourselves. This actually came out a lot of early work that was being done at companies like My Space and Amazon, and Green Plunk itself actually, and also some academic work that was happening at Berkeley, under the leadership of Professor Hellestine there, Joe Hellestine who's sort of expert. So talk a little bit more about why in database predictive analytics. I mean what's the real appeal from a value prop standpoint? Well I think a big thing for me that sort of really inspired me when I first heard about Alpine, got involved with them at the founding and decided eventually to join the company, was that they just made the whole thing so much easier. So I had been involved in many analytics projects really for the last decade where the process of getting the data and refining the data, building the models and iterating and so on, it wasn't even iterative, right? I mean it's just like highly waterfall, highly static, sort of one-shot model development. It's like I hope this works, I hope this is the right data set and if we need to go back to the well it's just too painful. And what Alpine was doing, working with Green Plunk, because there's actually an early spin-off in Green Book, is going into customer sites and just give it, point us at your data and we'll find something interesting, like this afternoon. And so, instant ROI. Instant ROI, I mean I remember the first time we used it, so this is when I was working with my data scientist team actually using Alpine. I loved it so much, I joined the company. I went into a telco which had no data scientists, right? Never done churn models or advanced analytics before and literally by the end of that day we just took the source of their data, did the analytics directly where it sat and we had churn model, pretty decent, not like maybe production level, pretty decent churn models that they could actually use. You can see things immediately, trends that you could act on, you're saying. When you go into customer situations, what's the mindset that you typically find? Do you have to do a lot of evangelizing education to get people to understand this or? You do, I mean that's a real challenge for us because there is the certain mindset around it, but on the other hand I think people get tired of the existing infrastructures and the existing paradigms and one trick that we've learned is if you can go into an organization and demonstrate value quickly, just go after a small problem. We've been doing a lot of work closely with the data scientist team at EMC Green Plum and one of the things they've gotten really good at is just going into a customer, it's part of a pre-sales activity or an early engagement, maybe they've just gotten the infrastructure that you know Hadoop or the Green Plum database up and running and go in and say okay let's let's do something, right? Let's prove value quickly. A quick win, kind of then the light bulb kind of goes off and they see the power of what predictive analysts can do. Steve, that was great, I really appreciate the insights. You go into the data science summit, are you going to be there? Yeah, I'm super excited about it. I'm going to be speaking tonight at the opening session and I'm going to be interviewing a really cool team of really prominent data scientists and people involved in this community about like how do you build that dream team of data scientists so that's going to be a really fun discussion. We've had several on here today, Jeff's interviewed someone at SFI and so well congratulations on the company and where you're at, very exciting space, you're at the heart of it Steve, thanks very much for coming to the queue. I love this idea of instant ROI, I can just hear the CEO's ears perk up at that one. A lot of pioneers out there you've been talking to a lot of them, I really appreciate you sharing your conversations with us and it's always always great to hear you and your insights and everything and we'll be back chatting with you some more but we'd like to thank all of the audience for joining us at this point. There's more to come on big data deep dives so be sure to stay tuned to the conversation with my new best friends at the cube right here on EMC TV.