 Live from Midtown Manhattan. The Cube's live coverage of Big Data NYC. A silicon angled Wikibon production. Made possible by Hortonworks. We do Hadoop. And when this goes, Hadoop made invincible. And now your co-hosts, John Furrier and Dave Vellante. Okay, we're back here live in New York City for Big Data NYC. This is the Cube where we go out to the events and sometimes create our own or go to events and extract the signal from the noise. We are here in Manhattan right next to the Hilton. We're in the Warwick Hotel. We're talking about all the big data action happening here at Big Data NYC. Hadoop World and Stratoconference. I'm John Furrier, the founder of the Cube. We're here with Jim Kempile, COO of Wendisco and Jim Walker, Director of Product Marketing of Hortonworks. Guys, welcome to the Cube. Thanks John. Good to see you again. Well, I want to jump right into it. You guys, Jim and Hortonworks led the news yesterday. You guys were all over the news with your HD insights. A lot of buzz around Microsoft. Congratulations. Your data platform is getting a lot of traction. That's the big news again. The data platform is not just about Hadoop. It's about what's going on around it. And Jim, you guys are getting a lot of press around the data center continuous operations theme, which is extending into what I call the realities of Hadoop, which is, hey, I want to put it in production. So let's get into that. So you guys have a partnership. Let's talk about the Hortonworks Wendisco partnership. So what's the deal with the partnership here? Well, so what Wendisco offers is something beyond just high availability or failover. We like to talk about it as continuous availability. And it's not just in a single data center. It's globally across the WAN over multiple data centers. And what this enables is a whole new range of use cases that here at A4 were basically impossible. And I can get into a little bit of that a little later in the interview, but really we're focused on continuous availability. And what that means is it doesn't mean that you have one data center where you have a cluster that's fully readable and writable and another data center where it's purely a read-only backup. In our architecture, not only do you have multiple active name notes for every Hadoop cluster, but you also have multiple active data centers. And we literally enable you to have a single Hadoop cluster across multiple data centers over WAN. So what that means is everybody everywhere has effectively real-time access, read and write access to the same data at all times. This is fundamental. And we were talking before we went on here live is about that continuous operation. And Jim, I want to ask you about the part on your end of the partnership. It's really a 2.0 story. Just a couple of weeks ago I was with the CEO and Chairman Jeffrey Imelt of GE. And he said, for them, it's about non-destructive operation, which is basically the same thing. If your business goes down, even a small percentage change makes a big difference. Why hasn't this happened sooner? This is such an obvious thing. But still, it's a major thing right now. Yeah, we were talking about this just before we went on. And you know how this used to work? It was a phone call in the middle of the night. And so having continuous availability is absolutely important. It's a key enterprise requirement. We see our customers and prospects are asking for this sort of thing. Now, I consider it a 2.0 story because really I think the release of Hadoop 2.0 or 3.0 weeks ago, oh god, man, time flies, really opens up the platform to integration into some really interesting things. And so the partnership with Jim at Wendisco is really critical to our enterprise customers because it's, like I said, a key requirement. Reliability, all the core stuff, the ilities, right? The reliability, security, usability, all these things are huge. And it's just one of those pieces that, look it, people are going to adopt Hadoop. And everybody's interested in Hadoop. Every RFP I've ever heard in my entire life has certain questions on it. And this is one of those key questions that are asked. And I think it's important. And Jim, one of the things we were talking about in our other couple interviews earlier is that people, when they grow up, they get these things that they get more experienced. And Hadoop is now growing up. And the enterprises have legacy requirements. There's a lot of nuances around making something enterprise-grade. What's your definition of enterprise-grade? What are you hearing from customers? And then how does that relate to how the data platforms are evolving, whether it's the decoupling, the highly cohesive nature of how apps are being built, all the above? Can you talk about that? Sure. So I think, if you look at the way Hadoop is being used by and large today in the enterprise, what's really happening is they're using it to flatten their data warehousing costs. Basically, what's very common is any data that's older than six months that they need real-time or relatively real-time access to that's not going to change, they basically put it out to Hadoop. And then they're paying less for their commercial data warehouse software. One of the things that's important to note is that this lack of what we want to call enterprise-enabled capabilities that you were just mentioning now for the data center to actually pass data center audit in a large corporation isn't there yet. And that's really the explanation for why only about 24% of the Hadoop clusters out there actually in production in those data centers. Give me an example. I mean, there's a lot of folks out there just trying to get their minds around this because it's a big concept around this whole. I mean, it's pretty straightforward, but it's concept, but explain to them what does continuous availability. Give an example of something going down as a full data center. Walk me through it for the folks out there, an example. Sure, so there's really two levels of this. The first is, let's imagine an example where you have two data centers where you have our software implemented with HTTP 2.0. So you have non-stop Hadoop for Hortonworks, which is the combined product name that we're going to market with. Effectively, those two data centers, Hadoop is fully readable and writable in both locations and all the users, all of the MapReduce jobs, all of the applications have access to the same data simultaneously. And this continuous availability works both within and across the data centers. So one of the key things to understand is the first thing our software does is it addresses the most fundamental single point of failure in Hadoop's architecture and that's the single active name node. So if you look at any other approach to this, and in fact, some of the improved approaches with the Hadoop 2.0, with NameNode HA using the Quorum Journal Manager and so forth, then eliminates a lot of the manual failover steps that had to be undertaken with Hadoop 1.0, only goes so far and it doesn't allow you to have more than one active name node in front of the cluster. When that cluster loses that one active name node, you're basically depending on ZooKeeper to manage the failover for you to the standby and you still have to be concerned about split brain data corruption, other things coming into play. And the other challenge with that is you're just limited to one data center. It is not designed to span multiple data centers over wide area network. So effectively what we do if we start in the single data center is we enable you to have multiple active name nodes in front of that Hadoop cluster. At the data node level, Hadoop already does a great job of replicating the data, but the metadata, the name node that actually instructs all the client applications, the MapReduce jobs, where to find the data in the cluster is effectively a single point of failure. So what we do is we use our patented nonstop technology to effectively replicate that name node. So the name nodes can be clustered and clients can access multiple active name nodes at the same time. So you don't have the single threaded situation where if that one name node goes down, everything stops. So that's a single data center example. And if you do take one name node down for maintenance or it fails for some other reason, as soon as it comes back online, it resinks automatically. The administrator doesn't have to do anything. Okay, so that's the single data center example. We're keeping the name node up and that has a lot of important side effects enabling higher availability for yarn because there's components of yarn that reference the name node to locate data when you're running MapReduce jobs and other applications, as well as just general HDFS accessibility for HBase and any other application you wanna run against it. So when we move beyond that to multiple data centers, what ends up happening is you have that same scenario over the LAN as if you have just one big cluster worldwide and that's effectively what we have. So if you look at our architecture and this may be getting a little too technical for the people that are looking at this right now for some people, but I'm sure most people that have implemented Hadoop are familiar with the fact that every node has what's called a rack ID. What we've done is we've added something on top of that called a data center ID. And literally across all the data centers where you've implemented our software, it's all the same cluster. It's all the same data. We call it one copy equivalence. It means if an entire data center goes down, those users can fail over to another one and keep working. So this is the non-stop Hadoop message. Exactly. That's basically what we're saying. So we had David on earlier, the CEO, he's talking about active replication. Is that, again, you mentioned active, active. So talk about this replication feature if you guys can address that. What is that? Why is that really important? Yeah, it's huge. I mean, I'm listening and this whole thing's kind of a, and I'm kind of sitting over the conference, John. It's a coming of age story right now. You know, Hadoop is maturing. Hadoop has grown into this ecosystem of tools and it's being used for some very wildly different use cases than it was used for over the past two years. And so, you know, as Hadoop matures and as people start to put more and more data into it, as they start to plug in different processing models on it, you know, using Yarn, which is the Hadoop operating system, these things, you need to be able to support those things. You know, having a global presence of Hadoop in an organization is critical for a lot of multinational organizations. This isn't just, you know, I mean, well, especially with the old web guys, I mean, it was always fairly important because, you know, the minute you go into business online, you're instantly in the world. So, you know, those use cases, it was very difficult to take care of these things in the past. And I think, you know, the active, active nature of Hadoop, you know, of replicated Hadoop clusters, it's a big deal. I've always said from day one at Hadoop World four years ago, even in the subsequent years, is a big enough beachhead for everyone. Don't fight over the fruit in one tree. There's plenty of fruit on all the trees. And it's interesting, this coming of age story is right on in that people are picking their past. I mean, at some point you have to, there's an old expression, you know, get off the pot if you will, kind of do something and actually make your move. I mean, so you see Cloudera, you see you guys and you guys are right with your positioning. People have to make their bets and have to pick a path, post the position. You can't be like, you know, shifting all the time. And that's what maturity is, finding that place. So with that, right? So what are the key positionings that if you guys have to talk to the marketplace, that you guys have settled in on, I know what your answer is going to be. We've never wavered, we've always been open source. Or maybe you have something new. And you guys have a nice positioning which is getting great feedback from the marketplace. People are voting with their wallets and customers. So is there a positioning you want to talk about now? I mean, that's that you're settling into that you'd say this is our groove swing, this is our vector, we're on. John, you know our position and it started from day one and it's never wavered. You know, we do all of our work in the open and we built an open source company around enterprise, an enterprise data platform. We've been talking about the enterprise data platform for years and that's what we've, for two years actually, two in a couple months. I don't think our strategy has ever changed. It's doing our work in the public domain and representing the enterprise in the community and bringing the innovation of the community to the enterprise. And that's really what we're all about. I mean, that's, I'm sure you could tell that at this point, right? Well, I knew the answer. I knew what you were going to say. I hope you do. No, I knew the answer was going to be more of the same but I really want to drill down because the store, if I were to talk to those guys tomorrow I want to drill down on that because I think we're seeing the definition of open source in this modern era become about and that was one of the other questions we had is can there be a red hat for a dupe? Can there be this and that? So we'll get to that later. On WANDISCO, I want to talk to you guys because you guys have talents in that company. One of the things that WANDISCO that I've always been impressed with is talents. They do not have short of good people that work there, smart people. You guys have a nice path right now you've built for yourselves on your positioning. Talk a little bit about that because now there's plenty of distros to work with. Your business strategy and your technical strategies right on path, talk about that and then let's talk about the customers. Because again, the customers define everything. They vote with their wallets. That's validation, so go ahead. Yeah, so our heritage is really about enterprise enabling open source. Obviously, you know our company, we came out of over seven years of doing that for Subversion, really the most popular version control system out there for software developers. Basically making that multi data center aware, if you will. And our positioning is we really trying to move people away from thinking of high availability, disaster recovery, backup, failover. We really want to use the term continuous availability to describe our solution. And what it means is real time access to the same data everywhere and no dependence on administrators or worries about human error when you have to do failover and recovery. So what we're also finding is that there's real demand for this obviously, there's so much rich data that Hadoop can store and provide access to. And the reality of it is that in order to make full use of it, a lot of these large organizations have to have these kind of capabilities just from a regulatory standpoint, let alone just their own business requirements. There is requirements, they have specific compliance issues, right? Well, I mean the classic example that I like to refer to that literally can be a matter of life and death. If you go to the emergency room and they have to look up your patient history and the database is down and they can't find it, that's a real problem. We can be that game changing. And our messaging is really geared around getting people to think about that and not think, oh this is just a new and improved backup and recovery solution. It's backup and recovery over the wan. We're much more than that. Okay, so guys, tell me about where this partnership goes in your mind. Okay, so how do you see it evolving? Obviously you guys have a good co-branding non-stop, great message. Everyone wants to take a non-stop flight somewhere. No one wants to take a stop anywhere. Never mind stopping their businesses. Where does the partnership go? Yeah, I mean, for us it's, like I said before, John, our prospects, our customers are asking for this type of functionality. It's an engineering relationship. I mean, it's a good partnership. I mean, you guys have a great open source legacy as well, Jim. I mean, you know our model. It just works. We understand each other on both sides of the organization. We're being pulled into a lot of different companies and large multinational companies. And they all have, and even some of the smaller companies, and they all have these requirements. So, the partnership goes into a very tactical, how do we go do business together? And it's feet on the street working together. The one nice thing about working with a company that does understand the open source world is reps and the guys in the field get it. And that always helps in a relationship. Yeah, you wanna have that group sitting together. Pastured scores, they say in open source. Jim, what's your take on the partnership? Yeah, I think it's a nice fit for us. I mean, we've always been committed to enabling people to use the open source solutions without having them in some modified form. And that's what working with Hortonworks enables us to do. They are standard HDFS, standard to do everything. All the components of people know and love from the Hadoop ecosystem will still work the same way once they're implemented. And Jim, you guys know Hadoop. I mean, your team knows Hadoop, and it's great. I know, I was cornered by COS, COS you're out there. I know you're watching from home, so we need to shout out. You talk my ear off about orchestration. Yeah, we have some. At least COS knows, yeah. And do a vodka once in a while too. It's a great team over there. Yeah, we have guys on our team that were involved in the initial development of Hadoop. Yeah, the HDFS you've interviewed again. You guys are awesome. I want to thank both of you guys, just separate from great content. And more importantly, supporting our efforts as independent of CUBE here at the Warwick Cross Street from the Hilton as an independent event. You guys really are underwriting this and helping us for the community. We really appreciate it. I want to personally thank you and tell the folks out there, Hortonworks and WinDisco. Really, we're underwriting and great supporter of the CUBE here at Big Data. We'll continue to cover this ecosystem as it grows, matures, and quite frankly, throws out some great business outcomes and profitability for everyone. This is kind of where the beautiful things about industries being as they grow up is, you know, things happen, good things happen, and strong gets stronger, and the weak kind of figure out how to get stronger or go away. So this is what's fun. This is a fun part. So people start making money and good things happen. Congratulations. This is the CUBE we're live in New York City. We'll be right back after this short break.