 Live from New York, it's theCUBE covering Big Data NYC 2015. Brought to you by Hortonworks, IBM, EMC, and Pivotal. Now for your hosts, John Furrier and Dave Vellante. Hello everyone, welcome back to Live in New York City for theCUBE, a special Big Data NYC. Our flagship program, go out to the events, things like that. Signal from the noise, we are here. Live as part of Strata, Hadoop, Big Data NYC. I'm John Furrier. My co-host Dave Vellante, our next guest is Jim Pinkley. The Chief Product Officer at WAN Disco. Welcome back to theCUBE, great to see you. Thanks, great to be here. You've been COO of WAN Disco, Head of Marketing now Chief Product Officer a few years. You guys have always had the patent, patents David was on earlier. I asked him specifically, why doesn't the other guys just do what you do? And I wanted you to comment deeper on that because he had a great answer, he said patents. But you guys do something that's really hard that people can't do. So let's get into it because Fusion is a big announcement you guys made, big deal with EMC, a lot of traction with that. And it's one of these things that it's kind of talked about but not talked about. It's really a big deal. So what is the reason why you guys are so successful on the product side? Well I think first of all it starts with the technology that we have patented and it's this true active-active replication capability that we have. Other software products claim to have active-active replication. But when you drill down on what they're really doing typically what's happening is they'll have a set of servers that they replicate across and you can write a transaction at any server but then that server is responsible for propagating it to all of the other servers in the implementation. There's no mechanism for pre-agreing to that transaction before it's actually written. So there's no way to avoid conflicts up front. There's no way to effectively handle scenarios where some of the servers in the implementation go down while the replication is in process. And very frequently those solutions end up requiring administrators to do periodic re-synchronization, go back and manually find out what didn't take and deal with all the deltas. Whereas we offer guaranteed consistency and effectively what happens is with us you can write at any server as well but the difference is we go through a peer-to-peer agreement process and once a quorum of the servers in the implementation agree to the transaction they all accept it and we make sure everything is written in the same order on every server. And every server knows the last good transaction of process. So if it goes down at some point in time as soon as it comes back up it can grab all the transactions that missed during that time slice while it was offline, re-sync itself automatically without an administrator having to do anything. And you can use that feature not only for network and server outages that cause downtime but even for plan maintenance which is one of the biggest causes of Hadoop availability issues because obviously if you've got a global appointment when it's midnight on Sunday in the US it's the start of the business day on Monday in Europe and then it's the middle of the afternoon in Asia. So if you take Hadoop clusters down somebody somewhere in the world is going to be going without their applications and data. It's interesting, I want to get your comments on this because that's a great highlight into the next conversation we've been hearing all throughout theCUBE this week is analytics outcomes. These are the kind of things that if people talk about because that means this check's being written. Hadoop is moving into production. People have done the clusters. It used to be the conversation of hey X number of clusters you do this you do that replication here and there yarn all these different buzzwords really feeds in speeds. Now Hadoop is relevant but it's kind of invisible under the hood. Yet it's part of other things in the network. So high availability, non-destructive operations is what our table stakes now. So I want you to talk about that nuance because that's what we're seeing as the things that are powering. It's the engine of Hadoop deployments. What does, what is that? Take us through that nuance because that's one of the things that you guys have been doing a lot of work in that's making it reliable and stable. To actually go out and play with Hadoop deploy it make sure it's always on. Well we really come into play when companies are moving Hadoop out of the lab and into production. When they have defined application SLAs when they can only have so much downtime and it may be business requirements that may be regulatory compliance issues for example financial services. They pretty much always have to have their data available. They have to have a solid backup of the data. That's a hard requirement for them to put anything into production in their data centers. The other use case we've been hearing is okay I got Hadoop, I've been playing with it. Now I need to scale it up big time. I got to double triple my clusters. I got to put it in with my applications. Then the customer says okay wait do I need to do more system work? How do you address that particular piece? Because I think that's where I think fusion comes in from how I'm reading it but is that a fusion value proposition? Is it a WAN disco thing? And what does the customer, and is that happening? Yeah, so there's actually two angles to that and the first is how do we maintain that uptime? How do we make sure there's performance, availability to meet the SLAs, the production SLAs. The active-active replication that we have patents for that I described earlier and it's embodied in our deco and distributed coordination engine is at the core of fusion. And once a fusion server's installed with each of your Hadoop clusters that active-active replication capabilities is extended to them and we expose the HCFS API. So the client applications, you know, scoop, flume, impala, hive, anything that would normally run against a Hadoop cluster would talk through us. If it's been defined for replication we do the active-active replication of it if we pass straight through and process normally on the local cluster. So how does that address the issues you were talking about? What you're getting by default with our active-active replication is effectively continuous hot backup. That means if one cluster or an entire data center goes offline that data exists elsewhere, your users can fail over. They can continue accessing the data running their applications. Soon as that cluster comes back online it resinks automatically. Now what's the other? No user involvement. No user involvement in that. Now the only time, and this gets back into what I was talking about earlier, if I take servers offline for planned maintenance, upgrade the hardware, the operating system, whatever it may be, I can take advantage of that feature as I was alluding to earlier. I can take the servers of the entire cluster offline and Fusion knows the last good transactions that were processed on that cluster. As soon as the admin turns it back on it'll re-sync itself automatically. So that's how you avoid downtime even for planned maintenance if you have to take an entire location off. Now to your other question how do you scale this stuff up? Think about what we do. We eliminate idle standby hardware because everything is full read-write. You don't have standby read-only backup clusters and servers when we come into the picture. So let's say we walk into an existing implementation and they've got two clusters. One is the active cluster where everything's being written to, read from, actively being accessed by users. The other is just simply taking snapshots or periodic backups or they're using DCP or something else but they really can't get full utilization out of that. We come in with our active-active replication capability and they don't have to change anything but what suddenly happens is as soon as they define what they want replicated we'll replicate it for them initially to the other clusters. They don't have to re-sync it and the cluster that was formerly for disaster recovery for backup is now live and fully usable. So guess what? I've now been able to scale up to twice my original implementation by just leveraging that formerly read-only backup cluster. Is there a lot of configuration involved in that? Is it automatically? No, so basically what happens, again, you don't have to synchronize the clusters in advance. The way we replicate is based on this concept of folders and you can think of a folder as basically a collection of files and sub-directories that roll up into root directories effectively that reflect typically particular applications that people are using with Hadoop or groups of users that have data sets that they access for their various sets of applications and you define the replicated folder as basically a high-level directory that consists of everything in it and as soon as you do that, what we'll do automatically in a new implementation, let's keep it simple, let's say you just have two clusters, two locations, we'll replicate that folder in its entirety to the target you specify and then from that point on, we're just moving the deltas over the wire. So you don't have to do anything in advance and then suddenly that backup hardware is fully usable and you've doubled the size of your implementation, you've scaled up to 2x. So, I mean, what you were describing before really strikes me that the way you tell the complexity of a product and the value of a product in this space is what happens when something goes wrong? That's the question you always ask. How do you recover? Because recovery is the very hard thing and your patents, you got a lot of math inside there, but you also said something that's interesting which is you're an asset utilization play. You're being able to go in relatively simply and say, okay, you've got this asset that's underutilized, I'm now going to give you back some capacity that's on the floor and take advantage of that. Right, and you're able to scale up without spending anymore on hardware and infrastructure. So I'm interested in, so another company announced an EMC partnership this week and they sort of got into this way back in the mainframe days with SRDF. I always thought when I first heard about when Disco says, it's like SRDF or Hadoop, but it's active active. So then they bought that yada yada. And there's no distance limitations with our active active. So what's the nature of the relationship with EMC? Okay, so basically EMC like the other storage vendors that want to play in the Hadoop space, expose some form of an HDFS API. And in fact, if you look at Hortonworks or Clutter, if you go and look at Clutter, a manager, one of the things that asks you when you're installing it is, are you going to run this on regular HDFS storage? You know, effectively a bunch of commodity boxes, typically, or are you going to use the EMC Isilon or the various other options? And what we're able to do is replicate across Hadoop clusters running on Isilon, running on EMC ECS, running on standard HDFS. And what that allows these companies to do is without modifying those storage systems without migrating that data off of them, incorporate it into an enterprise-wide data lake, if that's what they want to do. And selectively replicate across all those different storage systems. It could be a mix of different Hadoop distributions. You could have replication between CDH, HDP, Pivotal, MapR. All of those things, including EMC storage that I just mentioned that was mentioned, the press release, Isilon and ECS effectively has Hadoop-compatible API support. And we can create, in effect, a single virtual cluster out of all of those different platforms. So is it a go-to-market relationship? Is it an OEM deal? Yeah, I mean, it was really born out of the fact that we have some mutual customers that want to do exactly what I just described. They have standard Hortonworks or Cloudera deployments in-house, they've got data running on Isilon, and they want to deploy a data lake that includes what they've got stored on Isilon with what they've got in HDFS and Hadoop. And replicate across that. Like onerous EMC certification process. We went through that process, we actually set up environments in our labs where we had EMC, Isilon and ECS running and did demonstration integrations, replication across Isilon to HDP to Hortonworks, Isilon to Cloudera, ECS to Isilon to HDP and Cloudera and so forth. So we did prove it out. They saw that. In fact, they lent us boxes to actually do this in our labs. So they were very motivated and they're seeing us in some of their bigger accounts. Talk about the aspect of two things, non-destructive operations, meaning I got to want to deploy stuff because now that Hadoop has kind of got a hard and top with some distraction layer, then analytics to focus. There's a lot of work going on under the hood, as you said, and a large scale enterprise might have a zillion versions of Hadoop. They might have little Hortonworks here. They might have something over here. So there might be some diversity in the distributions. That's one thing. And the other one is operational disruption. What do you guys do there? Is it zero disruption? And how do you deal with multiple versions of the distro? Okay, so basically what we're doing, the simplest way to describe it is we're providing a common API across all of these different distributions, running on different storage platforms and so forth so that the client applications are always interacting with us. They're not worrying about the nuances of the particular Hadoop APIs that these different things expose. So we're providing a layer of abstraction effectively. So we're transparent in effect in that sense, operationally, once we're installed. The other thing is, and I mentioned this earlier, we come in, basically you don't have to precinct clusters. You don't have to make sure they're all the same versions or the same distros or any of that. Just install us, select the data that you want to replicate. We'll replicate it over initially to the target clusters and then from that point on, you just go. It just works. And we talked about the core patent for Active Active Replication. We've got other patents that have been approved, three patents now and seven applications pending that allow this Active Active Replication to take place while servers are being added and removed from implementations without disrupting user access or running applications and so forth. Final question for you, sum up the show this week. What's the vibe here? What's the aroma? Is it really Hadoop next? What is the overall big data NYC story here at Strata Hadoop? What's the main theme that you're seeing coming out of the show? I think the main theme that we're starting to see, it's twofold. I think one is we are seeing more and more companies moving this into production. There's a lot of interest in Spark and the whole fast data concept. And I don't think that Spark is necessarily orthogonal to Hadoop at all. I think the two have to coexist. I mean, if you think about Spark streaming and the whole fast data concept, basically Hadoop provides the historical data at rest. It provides the historical context. The streaming data provides the point in time information. What Spark together with Hadoop allows you to do is that real-time analysis, do the real-time informed decision-making, but do it within a historical context instead of a single point in time vacuum. So I think what's happening, and you notice the vendors themselves, aren't saying, oh, it's all Spark, forget Hadoop, or anything like that. They're really talking about cool stuff, too. All right, Jim from WAN Disco, Chief Product Officer, really in the trenches, talking about what's under the hood and making it all scale in the infrastructure so that analysts can hit the scene. Great to see you again. Thanks for coming and sharing your insight here on theCUBE. New York City, we are here, day two of three days of wall-to-wall coverage of big data NYC in conjunction with Stroud and we'll be right back with more live coverage in the moment here in New York City after this short break.