 Okay, we're back live here in Silicon Valley in San Jose, California for Hortonworks Hadoop Summit 2012. This is ground zero for big data innovation. All the alpha geeks and tech athletes are here. It's a developer oriented event where Hortonworks is putting on tons of tracks around data science, business intelligence, analytics, the future of Hadoop. And I'm joined with Jeff Kelly, my co-host and the CEO of Hortonworks, Rob Bearden. Thanks for coming on to theCUBE. And this is your show. Welcome to theCUBE and we're thank you to be here. But tell us about why you guys are running the show and what's different about this versus Hadoop Summit. I mean Hadoop World, which is now run by O'Reilly and Strata. Sure. Well I think actually a lot of the objectives are the same between the shows and they're separated by six months roughly. And a lot happens in the big data world. I mean it almost redefines itself every 90 days. And... But the objective of this show is multifaceted. But our key intent is that this is a show about big data in Hadoop for the community, by the community. And what was very, very important to us during this process was that it actually is not a Hortonworks infomercial. It's a strong value and knowledge transfer within the community. And to do that there we actually designed the whole summit process by bringing in key people from the, and high influenced people who have different perspectives, different layers in the data stack. Some at the OS layer, some at the DW layer, some at the integration layer, some at the apps layer. And we formed a committee that was chartered and tasked with defining the objectives of the summit and establishing what the agenda was. So I wrote a tweet, I did a tweet yesterday on the news I saw on the BBC around Linux getting a, Linus getting an award for the BBC. And in the quote on the BBC, talking about Linux is that trust is everything. And in open source that is a key mandate. You have a background, obviously you have some history in open source. Your strategy here at Hortonworks is awesome. So talk about trust in this new environment where it's so rapid pace. Unlike Linux, which kind of hobbled along, you had a clear leader in Linux. This marketplace is moving at Mach 100. Trust is a key factor. And you guys have a specific strategy around kind of that red hat like view. So talk about that. Very much so to recap the strategy, what's very, very important is that we take Apache Hadoop, which are a series of modules and technologies. And bring those together and productize and package that so that we can enable Apache Hadoop to be an enterprise viable data platform at scale. And what's what's finally important is that, is that we're very, very open and transparent about how we do that from a roadmap standpoint, how we construct the packaging and productization and delivery of that, and how we enable the ecosystem to adopt that. And as part of our core strategy, all the work that we do to make Apache Hadoop easy to use and consume and an enterprise viable data platform at scale, we believe it is vitally important to do all that work 100% as open source in Apache. And then HDP is our distribution that is, again, 100% open source, all Apache. The reason that's important is that we must make the market function. Today there are multiple distributions with varying degrees of support and compliance toward Apache and open source. In some cases they are, but there's critical hold backs that without that functionality, it's really not an enterprise viable platform. They can run it at scale. So I want to drill down on that because it's some nuance in the open source. If you, for the folks not in the open source community, talk about that hold back. So I want to ask you to explain that, what that means Apache, the donation, and then also the hold backs. And two, how you guys plan your Horton DataWorst platform, the HDP, how is that different from CDH from CloudEra? Right, so talk about the hold back and then. The CDH and HDP are both working within the Apache Software Foundation. And what we believe is very important for Apache Hadoop to become an enterprise viable data platform is that it must be easy to use and consume. So all the productization and packaging functions that have to happen across any enterprise software, those same principles must exist for within our distribution. And again, all open source. Things like monitoring and management and provisioning, all of the, you know, are some of the critical aspects for the operations organizations in an enterprise to be able to interoperate and to put Apache Hadoop into production and maintain it at scale. Things like that disaster recovery, data recovery, things like high availability. But all of those sort of core enterprise functions are the things that we're very, very focused on within the community and getting those things back into our distribution through Apache. And with that, we believe that we then can evolve Apache Hadoop into that enterprise platform that can then be adopted to the unwashed masses at scale. So what's the, I guess, what's the risk for some of those holding back some of the components or some of the, you know, in different distributions, different proprietary components? So what's the risk there? Is it simply that it's just not going to be able to reach that scale of adoption? And why specifically do you think that? Well, it's risk-reward. In some cases, it may or may not be a risk. What we think is important is making a patch if Apache Hadoop is an enterprise viable platform stand alone, it'd be risk both for the ecosystem and the enterprise from adopting it and that they don't find themselves in a position to where they only get the enterprise functionality if they're paying for it or they're paying for proprietary potential functionality. And then they end up in an awkward position, maybe as they are with Oracle or some of the proprietary platforms. Again, it goes back to philosophical view from our standpoint that if we make it an enterprise viable platform through open source and Apache, it'd be risk both for the ecosystem and for the enterprise, the adoption and the market can function much faster in its scale. You guys have been pretty transparent about the shortcomings of what Hadoop has and you guys are racing to fill those enterprise-ready needs, high availability, et cetera. We talk to a lot of customers who don't know the whole Hortonworks, Cloudera, open source, but they recognize that it's happening, they just don't know what's going on inside the community. Their version of scale is completely different than what the capabilities are. I mean, obviously, batch right now is looking good but availability near real time is kind of the buzzword everyone's talking about, but on OLTP and other environments, for big banks and big data centers, it's not ready for prime time. That's pretty well documented. So what are you guys doing? Obviously you're partnering, you're racing to fill the holes and talk about that and then talk about the needs of the big players like EMC, like HP, like IBM, they're dealing with their customer base who they're dipping their toe in Hadoop, they're doing some stuff, but they're not moving it into production and mission critical. So you've got that balancing act, you're running like the wind to face that product, beat your hole, you got to fill those holes. Same time, the market wants scale solutions. So talk about that. And scale can come in different fries, whether it be batch analytics or to your earlier point, real time. And what's important is first of all, we must stabilize core Apache Hadoop and ensure it can be viewed as an enterprise viable platform. From there then we can begin to build out for the other use cases and address those use cases through the ecosystem initially by driving, make it very, very clean and easy for those ecosystem providers to integrate and optimize their functionality or service. And you do that by creating reference architectures that are deeply engineered together so it's transparent to a customer who's trying to enable, for example, real time analytics to do that with a caching layer and or H base deeply integrated within Apache Hadoop. So you get a closed loop effect of a data flow. That creates real world value within the use case objective that's there. Can you talk about your experiences in your past life around how with Linux, for example, you, there was the same thing going on where it was evolving really rapidly and then the enterprises large at scales wanted to, what was the process? How did, and how does that compare in contrast to Hadoop? Meaning on Linux and how did it so across the chasm to use our- And open source is a dynamic here between open source, community oriented software development and then commercial grade deployments. What's finally important is that as a principle of open source that it reach critical mass from an adoption standpoint. And so to do that, the technology must be relatively at parity and with other closed source options or it must be innovative within its own rights and solve very fundamental core technical issues but it also must be able to be viewed as having the ability to go to production in terms of stability and viability as a standalone tech as well as that it has a company behind it who has the ability to support it at the same levels of SLAs that proprietary enterprise software vendors do. And that's what's important. In order for Apache Hadoop to reach that critical mass of adoption, it requires both the hardening of the technology and as well as ecosystem adoption to pull it into the reference use cases that the enterprise can see how they gain value back from managing this whole net new data set that they've not had the ability to manage or derive value from to date. So let me just ask another question. So at the end of my post I wrote this morning, let's be clear, infrastructure and platforms are really important right now in big data because it's hard to do. So you have two dynamics going on in this marketplace. One, at the infrastructure level, there's a lot of action going on and got virtualization. You've got all kinds of changes. All the OS guys, HA at the infrastructure level agree. It's insanely great in a good way. And complex. And complex but opportunity wise it's a whole converged infrastructure we've been covering that like a blanket. So that's hard and there's a lot of action and that's not off the shelf development. It's some real tech involvement. HP and the big guys are cracking that code. So you got to play in that world. You're the side of the end game. So I didn't say, but the end game in the big data market is about analytics and applications. Well that's how values derived from it, right? Yeah, so you got the hard stuff that all the plumbers and the systems guys are working on and you got the real value which is going to be the disruptive enabler which is analytics and apps. So Mike Olson talked about this extensively at the Duke summits application market. You know, even Excel has a big data fund around that. So the question is for what tools do you guys have in HDP around slicing and dicing of data and how are you guys going to enable the entrepreneurs out there to get behind you guys? Right. So a couple of things. What we don't want to do and it's part of the core strategy but we do not want to do is beyond the core platform. Our mission and role and value to the community is to make Apache a Duke an enterprise viable data platform. And as part of that, we have to enable the ecosystem to adopt it en masse at scale. The first step of that is we de-risk it from two elements. Number one, we harden it and make it an enterprise viable. Number two, we make it incredibly open and 100% compliant within our distro to Apache. So they're de-risk from being painted into a corner with proprietary functionality. But then the third dimension to that is it's vitally important we have very, very clean open interfaces that they can integrate and optimize their functionality or service. Whether it's at the OS layer, the integration layer, the DWDW layer, the tools, the apps. It would not be smart for Hortonworks, from our view anyway, to try to go have that analytic or apps functionality and isolate or disintermediate core parts of the ecosystem. We would much rather make Apache a Duke, a clean platform that they then can take their existing platforms and tools, integrate and optimize and extend the value to their customers that they create value for today probably in structured data sets and easily extend the power of Hadoop to bring in these unstructured or multi-structured or semi-structured data sets and make that transparent to their user because and extend the value of their existing platforms to now include Hadoop and all these new data sources and make it transparent to their user. It gives them the opportunity to one create value and candidly re-monetize their existing platforms and tool sets. And that will then help accelerate the critical mass adoption, but more importantly, increase the value creation that Hadoop is doing as an extension to the existing platforms and services that are already in place. We see the application as the complete disruption and we think, and in fact, the quote was, you're putting the big guys on notice with that approach because the application is where the value is created, as you said. But the question is that Jeff and I always talk about is, okay, the business model, is it just services? I mean, come on, tell us. I mean, it can't just be that. What is on your mind on the business model? Just 100% open source and you guys are going to derive, extract rents from that by how? Support services. To the big guys. Support and training. Both to the ecosystem and to the enterprise. So as the ecosystem provider integrates and optimizes their platform and they're redistributing HDP, they're going to want to have and ensure that they have level two, level three support so that they can ensure that they're meeting all the enterprise SLAs that they're committed to within this new data set that they have to manage for their customer base. And we're seeing great evidence of that already. And then the enterprises that they're using Hadoop to transform their business models in many cases. So as you guys, I see that make a lot of sense and you guys are well funded. I think I see benchmark and you have experience with benchmark with those guys. Great firm, tier one. Who else is invested in you guys? Index. Index, so you got a lot of cash. You don't have to worry about money in the short term. And that you can make money off the service. And you guys are going to be transparent about any business model changes going forward. Absolutely, we are absolutely committed to that and to be very clear. It is not smart for us to try to go up stack and isolate or disintermediate that ecosystem. In fact, to the contrary, we want to make it incredibly easy for them to integrate and optimize and leverage Hadoop across their platforms. I want to talk about, you mentioned kind of helping customers kind of leverage existing investments in database technology applications, kind of integrating Hadoop into that environment. So what is your relationship? How do you balance working with some of the more somewhat traditional database vendors, the TerraData's, IBM's, others, who on the one hand, you're helping customers to derive more value from those systems, that they've invested a lot of money in over the years. On the other hand, Hadoop, if it reaches its full potential, could potentially be a disruptor to IBM's business, to TerraData's whole business, to Oracle especially. So how do you balance that? And what is your relationship with some of these larger vendors, database vendors? And how do you balance that kind of helping customers today leverage that, your technology, to drive more value from existing investments to, hey, Sunday we want Hadoop to be the de facto platform and potentially, we'll see the relational database go the way of the dodo. Yeah, well, first of all, let me sort of hit that in reverse order if I can, and your points are well made. I do not ever see Hadoop as a platform disintermediating relational technology, in fact, to the contrary, I just, I don't- H-Catalog actually is filling that gap right now. Exactly, but, so with that, let me start at the beginning now and go back to the beginning of your question. Different platform providers view Hadoop in different lights. And, but in any case, what we're not here to do is to try to disintermediate TerraData's platform or the IBM platforms or the HP platforms. And in fact, what those platforms are doing they're managing today a much different data source than the traditional data source that Hadoop is managing. Hadoop tends to be managing the unstructured and the multi-structured of machine-generated data, social data, geospatial data. And what typically happens is the traditional platforms either architecturally or financially it's not pragmatic to store that volume and or type of structure of data in those platforms. And so our goal is to actually work with those platforms to actually have the ability to leverage Hadoop transparently in this net new data structure and extend the investment and value they're creating for their customer to also now include these new data sets and leveraging Hadoop and the benefit of the bargain of Hadoop on commodity scale processing and storage capability. So now they can then go to market with the ability at a much more aggressive price point per terabyte of data to manage and combine sort of apples and oranges of data structures and to create transparent value very, very rapidly for their customer base and that increases when the value they create for their customer increases their opportunity to monetize also increases. And so it actually, if we do our job correctly we give them the ability to bring Hadoop into their customer base. It actually makes them more defensible with their customer. I mean, well, not that I'm going to point out anything that's not obvious to everyone in the analyst community but when you guys are really disrupting existing tools. So what I found interesting was the VMware relationship the high availability pre-existing virtualization and VMware did this by the way the server vendors if you look back at what they did. So that's really good for you guys because you're taking a market that already exists. You're not trying to reinvent. So you're making the market on the Hadoop side which is great, I get that. Now you got to come in and deal with pre-existing technologies. How do you walk that line? Because you're disrupting that market. Well actually what we want to do is leverage the stability of that market. And so why go create brand new HA services? You know, within the stack it's better and cleaner to actually leverage the HA services at the VM layer and as well as at the OS layer and what you'll see is us very, very quickly come back and announce deep reference architectures and integrations leveraging the OS layer services for Windows and Linux environments at the OS layer. So the other thing that markets are made by lower cost solutions that deliver value. So we've been following the data warehouse business that tells where and there's an old school there talking about that, right? So like the old school guys, the long and the teeth. And EMC and these guys, they're making a ton of cash on these solutions and the customers have huge investments. Incomes Hadoop, incomes HB, you got Mongo. You can actually do filers and stuff, very low cost. So that's super disruptive. So again, that's again another disruption. How is your conversations in the customer base and with the vendors you're trying to do partnerships with? Talk about the partnership vendors first like the big data warehouses. Hi, I'm Rob. I'm here to disrupt your business, work with me or you'll die? I mean, tell us what, how does that go? Well, differently in some cases really well and in some cases horribly. And for example, so let me give you a couple of examples, tier data, they really get it. They understand the importance of that the enterprise places on bringing in all this other data sources and types. And they realize that it is architecturally and financially not pragmatic to do it within their existing platforms. And they had the, it's a natural extension of the Aster platform to leverage Hadoop as well as a natural value creation to take tier data classic and also extend for the support of Hadoop and bring both data types together transparently. They absolutely get that. They are aggressively engineering at the lowest levels of integration points, HTTP and both tier data and Aster with some very, very high value solutions. Do you see these guys thinking along the line? The old expression goes cannibalize your own before someone else does it or eat your own before someone else does. So obviously with lower cost value, the shift of value will go is go to something else. So if you're a big IT player like Microsoft that's obviously partnering with you guys and they're eating their own dog with a zore and they're going down that path because I think they see the IT wave a little bit differently. So what's your view on that? I mean, do you agree with that? Do you think the big guys start shifting their value? Well what they see is, let me do it, what I believe they say, I can't speak for any of them but our observations is that this is a whole, what you really have to make sure that philosophically you view and believe is that this is a whole net new data set and a data type that today predominantly is not managed by any of the platforms today. And this data set and type according to IDC numbers and Forster numbers is going to be in the next two to three years anywhere from 50 to 80% of the total addressable data in the enterprise will be in this unmanaged multi-structured or no-structured data sets today. And so they see that as an opportunity, in some cases a threat, but more of an opportunity that if they go manage this well and create value around this, not only is it an opportunity for them to extend their value to their customers, it's not monetized today, more importantly, it can add great value to their existing portfolio of products today that they're already leveraging against structured data sets. And so they see it as a very, very big net new opportunity. And it's a perfect storm if you think about it, we're just in New York City with Intel around the Open Data Center Alliance, which essentially they're trying to understand the destruction of the data center and ultimately what I walked away from that show was cloud really isn't happening to the scale that people thought it was because the big investment in data centers, server consolidation, all that stuff happened over the past 10 years, they're kind of already cloud ready and they're not going to move everything into the cloud. But big data is a little bit different. It's totally disruptive. So the question is, is that will the cloud really enable more big data? And that brings up the skill gap problem. So like right now, you get to talk about data science here, it's track, whole track dedicated to data scientists. It's just the tools are difficult. So- And they're very immature. Absolutely. So how are you guys looking at that? Also you want to pump up the ecosystem and invest in that on both the code side but also on the skill side. So talk about your strategy and your view of like, how do you get more people involved? So what's very, very important. So make sure I address something head on. What's not smart for us in our view nor could we do it even if we wanted to is hire enough people fast enough. There's a huge gap in skill set and domain out there. But we cannot hire enough people fast enough to create a services business that actually moves the needle for enough enterprises to help in the adoption of Hadoop. The services business has the same challenge. There aren't that many- As Sean Connelly says, ring indoor bells. One at a time. You're in a full business. We have to create full models. So what our strategy is, is not to focus on the traditional professional services but more importantly, focus on knowledge transfer specifically around training. So deliver, so go build great content and we're getting good and we're getting a cadence about that and evolving that. Building great content that we, and then we knowledge transfer to the big SIs so that they can go out and create and to their customer base, the leverage of knowledge transfer to Hadoop. If we could just back up just for a second. You mentioned a kind of transparency to the end user is very important. You mentioned a few times. So we've seen over the last year or two a lot of different of the MPP data where housing vendors and others create connectors. So you can move data between Hadoop and Vertica or Teradata or wherever it might be. Right. So from a long-term perspective, is that a viable long-term strategy having separate systems? And talk about the transparency from a technology point of view. How are you making that, how are you doing that? Are you actually achieving that transparency? Let's see, I think that's what Hadoop has thrown into question now. Right. So in the discussion this morning, the keynote, that's what's happening with the big shift. When enterprises look out 12, 24, 36 months and they look at what is their data architecture look like? Or what is the volume of data? Where's it coming from? Where can they get value from it? And they're realizing that the vast majority of that data is going to be unstructured. And they need to manage that. They need to store it differently, manage it and process it differently. That they're revisiting what does their enterprise data architecture look like and how is that constructed? And how do they move data through an application lifecycle and where does Hadoop fit into that? And it's our job to make sure that Hadoop can be that enterprise viable stable platform but also be very, very open so that it can interoperate with all the other existing technologies, platforms and solutions and investment that's already in place and allow them to transparently interoperate with that at scale and get the efficiency back out of that without any threat of going down a proprietary route. So I got to ask you about, obviously because we're big Cloud-era fanboys because they were the only one when we started covering Hadoop. You guys are really the second big commercial I call venture backed startup that's growing the ecosystem. Obviously you have different approaches. You're quoting the press release, which I commented on yesterday was not dissing the competition but you clear about your positioning. How are you just for the record just talk about to the folks out there Horton works Cloud-era. There's two different approaches. Why Horton works over Cloud-era? Yeah, well what our job, again, I've been very, very consistent and very transparent all along and we remain committed to making Apache Hadoop an enterprise viable stable platform easy to use, easy to consume, the enterprise functionality all included in our distribution, in open source, 100% Apache compliant and then very, very focused on ensuring that the ecosystem can interoperate with it and cleanly adopt it. And then our model from an economic standpoint is where there's enough value, pay us for support. If there's not, there's no encumbrance that you're continuing to use the platform. So your strategy is all technology development, 100% Apache compatible distribution and then pull support. So the pull comes in from the growth. You're banking on the growth of the market to give you that efficiency and profitability. And our core philosophy is that if we drive enough value into the core platform from a stability standpoint with enough innovation in the functionality and the technology, that the market will take that and the market will function. I got to say, Rob, I want to just say for your entire company, I'm really impressed by how you guys handled yourself with the past year when you guys launched a year ago or the company when Eric kind of spun out of Yahoo was kind of quasi-ugly and for the folks living in Silicon Valley there's all this stuff going on in the back rooms around stuff like that and the press tried to make it out into a Cloudera versus Hortonworks, kind of a mud slinging match. But I got to say what happened is you guys really kind of, you know, kept your cool and high integrity and also like the market grew around you. So it wasn't about Cloudera Hortonworks. And so I think ultimately that's the benefit to the community and you guys did a good job of kind of not really going there in that silly conversation. So I wanted to base that. The press and the media loved to, and they're incented to do that. Mike and I are on a very friendly basis. We meet often, our teams collaborate and we're very committed, both of us, to making Apache hit a great place. It's been fun to watch and I want to follow my story because I was pretty critical to beginning like I trust these guys, but you know what, you guys have a ruin in the team, you guys have done a great job and it's clear you guys are doing some great work. To answer the question though, both Cloudera and I and Hortonworks are very, very committed to making Apache the center of gravity of the next generation data platform for the enterprise. We're getting the hook. So just on the sound bite, on the exit out, I want you to just talk to the crowd about your agenda for the next year. What's your goals? It's the CEO of Hortonworks. You got a lot going on within building your own business and building that venture up and scaling it. But you got a marketplace that's exploding in demand and not enough supply. So what's your agenda? Brings stability to Apache Hadoop and to bring those enterprise data services within Apache Hadoop, continuing to make sure that we are a good steward within the community to innovate the 2.0 line and making sure that we're doing the things within the ecosystem that generates a pull market and is getting Apache Hadoop to drive value for the enterprise with the ecosystem and organically. Rob Beard, the CEO of Hortonworks, who's putting on this show, they're investing a lot in this and the ecosystem trust being a good citizen on top of a highly competitive growing market. It's fun to watch, fun to be a part of. This is theCUBE and we're going to actually have another good citizen coming up right after this guest, Doug Cutting, the inventor founder, co-founder of Hadoop on theCUBE again and CUBE alumni, good to have Doug on. So stay tuned for Doug Cutting right after this short break.