 Live at Hadoop Summit, this is theCUBE. This is our flagship program. We go out to the events, extract a signal from the noise. Hadoop Summit is in Silicon Valley live from San Jose Convention Center. This is day two of extended exclusive coverage, SiliconANGLE and Wikibon. I'm John Furrier, the founder of SiliconANGLE. I'm joined by co-host. I'm Dave Vellante, wikibon.org. Charles Zedleski is here. He's the vice president of product marketing at Cloudera, CUBE alum. Charles, welcome back. Good to see you again. Good to see you. So, we're entering this new phase of Hadoop. You guys were there from, you know, ground zero. And where are we at? How would you describe it? Well, I mean, you know, I think I've been talking to folks in theCUBE for three years now, maybe more. So, it's kind of, it's exciting to think that we've gotten to talk about a whole industry shape in front of our eyes, you know? So, it's a market category. It's a new kind of platform. It's an incredible set of stories that we read about in the news all the time now about how this technology is changing business and organizations. So, it's a pretty exciting time. And of course, you know, Hadoop Summit, a lot of technical tracks, a lot of discussion about the evolution of the technology is some of the new advances. And we're starting to see people's conceptualization of what this platform is. I think people's minds are changing right now about just what it's going to do and how far it's going to go. People are starting to understand what the long-term potential is. And it's much more significant than what people might have originally thought. People believe, as Avi Mehta just said. People believe, yeah. Yeah, absolutely. Charles, I mean, one of the things that we're proud of is working with Cloudera. You guys, and I mentioned this on a blog posting this on Facebook to my followers and friends, is that you guys have built the industry. Cloudera was the first company to commercialize. I see Yahoo had the big part of it. Doug was there now at Cloudera. You guys were the first. You guys were the pioneers. You had no competition when you started. Yeah. And you guys really built a great team and a great company. Thank you. And you pioneered a lot of that stuff early on and continue to be the leader. And so, since then, competition has entered the fray. Sure. The word enterprise grade is now hitting this conference and the talk of business solutions, business values kicked into high gear. Right? So one of the things that's come up yesterday was what we noted was, what is the value? People are, there's a, as this becomes enterprise grade, the POCs are increasing their in scope, production deployments. And now it's the business value conversation. This is where the big investment's coming in. So I want to ask you, one, that conversation you guys have had, and I've seen, I know you can't talk about some of the names you have for customers. Some of you can, you can't. Sure. But I know for a fact, you guys have some pretty big customers and big deployments. You've been there early with these big POCs. Right. And you do a lot of hands-on activities. So you're in there, probably the best in terms of that. What is the business value conversation that's hitting the mainstream today? Because, you know, you had the early adopters. Yeah. They've been digging in, and certainly with you guys. What is that business value today? One thing I want to, I want to touch on really briefly, if we're going to do business value, and you mentioned enterprise grade. You know, I think in the three years that we've been talking, Hadoop is ready for the enterprise, has been a spiel for at least three years running. Oh, now it's, you know, with this next thing, before it was an enterprise, and now it's an enterprise. But the reality is that very large, and very traditional organizations have been running this in production for years now. So really it's a question of what use case, and what industry, and what circumstances, and really the reality is everybody's definition of what's ready for them is different. And every year, the percentage of the enterprise customer base that says, ah yes, now I can trust this business process, or this workload to the platform grows. We have telecommunications customers that have 20 petabytes under management in production today. In most enterprises that use Cloudera, we are already the single largest repository of data they have in their organization. So larger than a storage array, larger than a database. So we're pretty far down the path of enterprise. Well let's stay on this for a second, and then we'll talk about business, because I think this is really, really important, because we've noted, certainly in theCUBE, about the FUD in the marketplace, right? There's a lot of FUD, certainly against Cloudera, you're the leader, right? So everyone's going to shoot arrows at Cloudera because you guys are leading the pack. But enterprise grade might not mean the same for a big established legacy vendor, because they don't want their customers to know it's in production. They might say, hey, oh they're only doing POCs with Hadoop. And what you're saying is, no, there are companies like Facebook and big web companies that have been doing it. And I think that's absolutely part of the dynamic. So you said before, like Cloudera sort of had the feel to itself, today, today we track eight companies, including ourselves, that claim to be Hadoop vendors. Very high fraction of those are kind of incumbent systems and technology vendors, incumbent data management vendors. And it's a little distant, they're trying to kind of have it both ways, right? Because on the one hand, they want to say, oh yeah, we've got that too, let me check the box, right? And sort of co-opt this amount of enthusiasm there is for this movement. At the same time, sort of damn it with faint praise and say, well, someday it'll be ready for the enterprise, right? So it's trying to have it both ways. And I think that's why Cloudera's continue to thrive. That's why independent companies have continued to thrive in this Hadoop market, is because we know that it's actually delivering value in the enterprise today. And we have sort of the most ambitious vision for the potential of the technology, whereas the other folks are sort of saying, well, I want to check the box, but don't forget about this large catalog of- Let's just be candid. If I'm a company and my clients, and I don't have a Hadoop solution, I am not going to tell my clients, oh yeah, yeah. We don't have Hadoop. Of course it's only POCs, of course I'm going to deposition that. So if I sell boxes, and now all of a sudden there's a larger repository that's called Hadoop, next to mine, I'm going to freak out a little bit. Well, that's a natural reaction. That's a natural reaction. But I want to ask you, let's define enterprise grade, because you mentioned that. There are some saying, depending on how you look at the elephant in the room, there's different versions of it. So just break down from your perspective, enterprise grade level, because you said you're doing a ton of enterprise grade. Let's break that down. So I think that the properties that people care about, so I'll give you probably five. One of them is availability, right? And that was, so people just need to have their system run in a highly available mode at all times. Another big one is recoverability. Recoverability could mean recovering from user mistakes, application corruption, or recovery could mean like recovering from a data center outage. So this is some combination of disaster recoveries, kind of part of recoverability, but so is recovering from user error. A third one is around security. So people care a lot about all the different forms of information security. A fourth one is around compliance. So I have all kinds of corporate policies and I need to make sure that this system fits inside the framework of my corporate policies. And I think the fifth one, which is I would argue actually the most important, even though it's not like a typical enterprise busword is usability. The biggest hallmark of an enterprise customer that's different from the web customers that you grew up with, is that they don't have a lot of MapReduce developers on staff. If you look at the Googles and the Facebooks and the Yahoo's of the world, they all can have a few clusters that'll have maybe 500 users, 1,000 users. There is no equivalent staff working at a large bank or a large telecommunications from a retailer. They have people that know SQL, they have people that know BI tools, they maybe have some people that know SaaS Prox or R functions. And so the biggest issue really I think for most customers is, how do I bring all those users over to this platform that has so many other advantages? Charles, you're the vice president of products at Cloudera, which puts you in charge of the product portfolio and you have a lot of experience in the tech business. So, let's talk about the areas of improvement. Where do you see the areas that need to be the improved, tweaked up? Because obviously the platform, there are developers waiting in the wings to start programming on top of Hadoop. And certainly the developer community needs to be bigger and larger. So they're waiting, what's to do? So in my opinion, I think the availability story is actually in pretty good shape, at least in the case of Cloudera. You can run every component of the stack in a highly available model. We can tweak here and there, but I don't really think that that's a source of big gaps. We have some improvements coming in terms of recoverability. We already have a DR capability today, but we're going to move that over to a snapshot based model. You know the details, but that's going to be a nice improvement. But recoverability, we've already satisfied to some degree it's going to get better in the future. And that's a cost play right there, right? It's largely right. You're going to significantly make it more efficient. You can make, absolutely. You can make it more efficient to cover the recoverability story. It needs almost snapshots, okay. The big ones is you're going to see some advances in the security side. So a lot of people have demand for database style, security, per column, per view, per whatever. And that makes sense. If you want to have 100 business users access data in Hadoop, almost by definition they should not get rights to all the datasets. But today you can only secure in very chunky core screen ways that make it very inconvenient for business analysts and business users to get at the data. So more fine grain security. Fine grain security is a big one. You're going to hear some news about that in the coming weeks. And then the big one which drove our investment in Impala, it's driven our investment in search and it's driven our collaboration with companies like SAS and Revolution Analytics is usability. How do we provide a BI experience which is comparable to what people are used to on traditional databases? How do we provide a machine learning or statistics experience which is comparable or superior to what they're used to if they're running SAS grid or enterprise minor? How do we provide something that even, not even like a business analyst can use but like a doctor or a claims adjuster, something you can do with like free text search. So the biggest investment we've made by far has been usability, adding new frameworks outside of MapReduce that allow us to attract new families of applications and new families of users to the same repository of data. I think that's going to be the big story of Hadoop for the next several years. Charles, talk about the search thing because you know, I mean, I was kind of commenting. I wasn't trivializing the announcement but it didn't seem like a hard technical problem maybe because I'm not understanding how solar was rolled out but obviously search is a core asset for people's usability. Did I get that wrong? I mean, obviously I'm oversimplifying but we'll just tease out what went on. Yeah, absolutely. What went on in search, explain it. I think your points valid to some extent. So unlike Impala, which is a query engine we basically had to build from the ground up because we needed to take a new kind of approach to make parallel MPP SQL work on the Hadoop platform. There was no open source project that we could adopt. That's a hard problem. It was a hard problem to solve and it was not like there was an existing open source project where we could just start adopting it and contributing to it. We had to go from the ground up but this wasn't necessary in search. If you look at what's possible, what you could do with solar, it already had the ability to scale out to large volumes of load and large volumes of data. It already had resiliency, felt tolerance. It already had a lot of the things you needed to be part of the Hadoop family. What we had to do though is we had to take what was historically a freestanding system. Most people today, if they use solar cloud, they have their own cluster just for solar cloud and it has its own management model, it has its own infrastructure, it has its own data sets and we needed to take what was a freestanding system and turn it into a feature of the larger Hadoop platform. It was an integration challenge. Yeah, so for example, the way you'll be able to use solar today in conjunction with CDH is you can store your index inside of HDFS. So that means if you've got a DR process that you're using for HDFS, solar comes along for the ride. So all those enterprise traits that we had invested in the core platform, solar gets to inherit a lot of that where previously you had to figure all that stuff out for yourself as a customer. Also the big thing we invested in is how to let solar work off the same data as all the other frameworks in Hadoop. So what you want to be able to do is you've got search users and you've got SQL users and you've got MapReduce developers. They all want to work in different ways but we want them working in the same data. Otherwise you just create little islands of data inside one cluster so you're sharing hardware but no one is actually collaborating, you're not able to build an end to end application. So we have solar able to kind of index and read from and be able to search all the stuff you have in HDFS and then even take those search results and later on you'll be able to create this. So it's a great endorsement to solar and that project. You guys essentially cobbled it in and integrated it and added some capabilities. And we also had to be a bigger part of the open source community. So we had Mark Miller who was a key solar committer and I believe also a PMC member joined Clodera. Solar actually was an offshoot from Lucine which was founded by Doug Cunning. So like everything else, if we're going to incorporate it into our platform then we're going to be a contributor and a leader in that particular open source community. Can you give us the update on Impala as we wind down? Absolutely, so Impala went GA a little while back. We've seen excellent adoption. If you look at the Impala user community, if you look at our customer base, the attach rate is extraordinarily high. If we just look at users of CDH 4.1 or higher which is when Impala came out. Of anyone who downloads CDH right now about 85% of the time they're also using Impala. The BI support has expanded, the performance has improved, we added columnar support so it's getting faster and faster with every successive release. And we're going to do a number of releases like every month or two from Impala where we continue to add either more SQL functionality or more things that lower the latency and allow people to do interactive BI at gigabyte scale, terabyte scale, dozens of terabyte scale. At some point we'll get to petabyte scale. Yeah, okay, and so things like user defined query, we've talked about this before, that's here now with search. Yeah, exactly, so there's those features but the biggest test is, how do we let more people get that kind of half a second to three second response time on a bigger and bigger set of data? And I think I heard Amir say yesterday that you've open sourced Impala, is that correct? Impala's always been open sourced under an Apache license. So it's an Apache open source project and presumably you guys are the biggest contributors to that. Talk about that a little bit. Yeah, so we've managed the project, meaning the release dates and the schedule is managed by Clutter employees but the software is all under an Apache license. So you can take it, you can use it at any scope and scale that you want, you can change it. If you want to get in the Impala business tomorrow, you're welcome to take it, go scrub the name off of it. It's up on GitHub and I can get it. Yeah, go call it the cube Impala part to the revenge, knock yourself out. So very, very flexible. Come on, Matt, Charles. We'll call it Silicon Angles Distribution of Impala. We're going to be announcing it every Hadoop world. We've been introducing a new technology except now we're going to introduce software-defined Hadoop. It's enterprise ready. Oh, it's already a software-defined. My final question, because we've got a break here, is maybe we can do the business value conversation another time. The open source communities, right? You've been involved obviously with Cloudera and being in your previous life. You've seen the open source movie before and it's been evolving, it's maturing in very rapid pace. We've been commenting and we've been ratified on the cube in multiple events this past summer tour that the new standards bodies are the open source communities in the old stack of the OSI model. You had bodies that would, governing bodies that would do that, not anymore. So the community is really, really important. So I want you to share your perspective on that, the ratification of these multiple Omni-channel stacks or solutions, you got Amazon, everyone else out there doing things. And the role of the community in ratifying standards, conduct, good conduct, community citizenship, contribution, what are you seeing as a best practice and or things that the community needs to continue to do to be successful? Because now you have, in a way, a folksonomy or a grouponomy around the managing of projects which are being voted with code. Yeah, I think your observation is valid. So I used to work for BEA Systems and that was kind of part of this generation of software where the standards were formed by a standards body. And maybe it was a job of community process which sun largely directed, or maybe it was like WC3. And what we found with all of those approaches to standardization is that they tended to get corrupted pretty quickly, right? You basically get like the top big corporations, the IBMs of the world, the Microsofts of the world who have money to burn and they can go staff these committees, they can go staff these committees with people that do nothing but either stall or kind of direct something in a certain way. And the startups that are actually like driving a lot of the new innovation, they just want the money to do that, right? So it's like a game of rope-a-dope and eventually these things just collapse into their own weight. Absolutely right. The way standards work in my mind right now with open source is that part of it is what you said, which is well there's an open source community and they add features and then that becomes a standard. But the biggest thing is that it has to be adopted. It has to be, if there's no adoption of the software that gets created, then it's not a standard. It doesn't matter how many people voted on it, it has to have the adoption. So there's many different collaborative models out there right now. I'll use Core Hadoop as an example. Core Hadoop has a relatively diverse base of contributors and it is widely adopted by lots of customers and that's what formalizes it as a standard. I'll give you other examples, something like Storm, which is just a GitHub project that Nathan Mars started and that's become probably the most popular way to do stream processing. And again, and he is a totally different method by which they take in patches and incorporate. So there's how the engineers work together, but at the end of the day, there is no substitute for good software. You've got to show up with compelling functionality that people can consume easily. And adoption is the vote. Adoption is the vote. Adoption is what decides what the standard is. Charles, thanks for sharing that. Of course, SiliconANGLE and Wikibon, we're tracking it. We will keep an eye on it. This is what people want to do. This is why we have theCUBE here and this is what we live for. It's really amazing innovation. I love the inflection point. I think we are at that kind of OSI stack model in this new world. Software is the key. You're seeing it even on the hardware guys. Software to find everything. You know, networks, compute, servers, everything is being software enabled and certainly Hadoop's a big part of it. Charles with Cloudera, the leading company, the first one in building the industry and now with eight people you're tracking, we're tracking a little bit more because you got some other fringe things developing. Thanks for coming inside theCUBE. Great to see you. We'll be right back with our next guest here at Hadoop Summit. This is SiliconANGLE and Wikibon's coverage. We'll be right back after this short break. Thank you.