 Live from Midtown Manhattan, the Cube's live coverage of big data NYC, a Silicon Angle Wikibon production. Made possible by Hortonworks, we do Hadoop. And when does this go? Hadoop made invincible. And now your co-hosts, John Furrier and Dave Vellante. Okay, we're back here live in New York City. I'm John Furrier with Dave Vellante. This is the Cube, our flagship program. We go out to the events or have our own events. Big data NYC. This is our event here. Getting all the coverage of Hadoop World, Strada, and what's going on in New York City. And this is our exclusive coverage. And I'm joined by Gaines Sundar, CTO, VP of Engineering of Big Data at WAN Disco. Want to say thank you first for supporting this great independent event and coverage. Thank you very much. Welcome to the Cube again. Thank you for having me. So obviously the discussion of this show is about business value, but really about the technology settling in. People are finding their sweet spot, their ground, their firm ground to keep their feet on and build out their businesses. So what I want to get your take on that dynamic. If you could just describe a little bit in your own mind what the words, what's happening around that dynamic. What are people settling into and what have you guys settled into on your growth path? We've heard earlier from your CEO, but I want to get your perspective. Certainly. So the distributions of course are the starting point. And we had many distribution vendors, including ourselves. But I think the market is settling down into Cloudera and Hortonworks, the big players and there's Pivotal and Intel. We, as I mentioned earlier, are not really focused on the distribution anymore. And we've moved up into a value added layer in the high availability and disaster recovery solution space. This is happening with other aspects of the Hadoop ecosystem also. The battle has been shifted up, if you will. The focus now seems to be on real-time processing platforms or near real-time processing platforms. The focus is on disaster recovery availability. It's on security. These are all the demands that the enterprises are making off the Hadoop ecosystem before they deploy it. Specifically in our case, we're addressing the problems of single machine failure within a data center and what happens if an entire data center goes down. We offer continuous availability in the face of both types of failures. If you look at other vendors, they're providing security solutions. There are vendors who are hawking solutions that do real-time analytics, such as the Taze and Stinger effort from Hortonworks. There's the Impala effort from Cloudera. This is where the battle lines are most interesting at this point. And the real-time has been a real focus. Talk about a little bit more about why that is the case. Why are people so interested in some of the real-time? So when we started talking to customers, majority of them came back to us saying Hadoop is slow and this was an unusual use of the term slow for us. If you dig deeper, what you find is that people are not familiar with the many seconds that it takes for a map-reduced job to spin up and a bit of computation to happen. That is what we mean by real-time in this big data ecosystem to avoid having to spin up these map-reduced jobs and to do computation using data that's already loaded into memory or already loaded into staggered storage. Perhaps some of it is in flash. That's one of the focal areas of real-time in the big data context. So can we talk about what it means to make Hadoop Enterprise ready? Since we've been tracking this, we've had numerous guests on theCUBE and many, many have said, hey, we're going to get in the market. They dove right in and said, we're going to make this Enterprise ready. I mean, I remember EMC said that to us, pre-pivotal, certainly guys like NetApp, HP, on and on and on, of course, Cloud Air and Hortonworks said, hey, we're doing that too. What does it mean? Because you guys are really focused on hardcore, hardening Hadoop. What does it mean to make it Enterprise ready? Absolutely. In our estimation, Enterprise ready is continuously available. You cannot afford to have downtime under any circumstance. If a rack in a data center fails, Hadoop should not pause. None of the applications should be unavailable, not for a single second. If the entire data center should go down, you should still be fully functional. In fact, if you have what is known as a disaster recovery solution, it should be of the nature such that both data centers are fully active. You can do processing on both. There is no read-only copy of the data in a remote location. That's not what enterprises want. Enterprises want full utilization of their hardware. If you have a data center in Asia and a data center in the US and you're ingesting data into either of these locations, you should be able to do processing on both of those locations. Failure of one data center should not cause any interruption in service. That is the first and most basic requirement for any enterprise that wants to take a big data solution into production. That is our belief. Next to that, you have security issues we need to address. The security built into Hadoop is considered inadequate by many enterprises. So there are enhancements in that area. And finally, the actual programming paradigm is map-reduced adequate for quick responses or do you want to do something more real-time? Those are the issues that are being addressed in that order. Okay, now let's break down which ones you guys are attacking. Obviously continuous availability is your sweet spot, right? Absolutely. That's the main focus of our business. We have years of experience building wide-area network-distributed computing solutions and we've applied that technology to Hadoop and we have a continuously available solution that will continue to run in the face of entire data center downtime. That includes links going down, that includes the whole data center, losing power, all of the above. Okay, so we had Avi Metta on earlier and he had this great little sock track he was talking about BC and AC. BC was before cutting, AC was after cutting. He's made of Avi Metta, but he's very straightforward and said, look, if you are building platforms, BC, they're not going to apply after Hadoop was sort of popularized, but I want to ask you, so you guys are doing some pretty interesting things with making Hadoop continuously available, but there are other technologies in the marketplace. Off-camera, I mentioned SRDF, for example. So I would think a company like EMC would be all over this. I mean, they popularized that whole concept and did a ton of business down the Wall Street, et cetera. So what are you seeing from competitors like that and why are you different, better, cheaper, faster, et cetera? To start with, EMC is a fine company. They have block storage solutions. All of the solutions are sand-based, including SRDF that you just mentioned. So the way SRDF works is that blocks on the disk are replicated across a wide-area network of limited distance. This has to be 50 or 100 miles. I don't know the exact specification, but it's the blocks that are replicated and the file system layers on top of that. So if you take a file system like NTFS or EXT4, both of them have time-outs when it comes to reading responses from a disk. So if they don't get a response in 10 or 15 seconds, the EXT4 or the NTFS file system gets very upset and that leads to, sometimes the file system drops into read-only mode, sometimes the file system refuses to service both reads and writes, that's downtime. This is the root cause of why they have distance limitations on SRDF. Now, the file system is not a layer of software that's developed by EMC or any of the disk storage vendors. It's operating system software. The paradigm changed with Hadoop in that we went to a distributed file system that resides on top of traditional disks and operating systems. We operate at the Hadoop file system layer. Our technology is integrated into the core of Hadoop's HDFS and we do our disk asynchronous replication at that layer. So we don't really have any distance limitations. It's a question of bandwidth and not one of latency. You can have a data center in Asia and a data center here in the US and so long as you have more bandwidth than the amount of data that you're ingesting, if you have peak hours between nine and 10 a.m. in the morning and your data ingestion rate is X, so long as your VAN bandwidth is greater than that, we can guarantee continuous availability of your applications, whether a data center goes down or a rack goes down or a machine goes down. So this is pretty fundamental because not only is there potentially a data center somewhere in some part of the world, but there's data. So we had a lot of discussion. I mean, when we first came here four years ago, John, you remember we were discussing the whole concept of shipping code, not data. So you're saying you sit in that layer, that Hadoop layer, so you're fundamentally a distributed architecture. Absolutely. We are fundamentally a distributed architecture and we employ Paxos-based distributed coordination system. That's the strongest, most reliable way of doing coordinated consensus in computing. Most folks have not been able to implement it because of the complexity of the solution, but as we pointed out, we've been in this business for over seven years now, building distributed computing solutions, so it was a natural extension for us to build this for Hadoop. Of course, Paxos has been around for a while. I mean, it's proven. Indeed. I mean, essentially somebody said earlier, the algorithms are free. It's what you do with the algorithms at a hard part. I presume you agree with that. Maybe talk about that a little bit. Totally. If you read ZooKeeper, which is the open source implementation of a distributed coordination algorithm, often they mention that they're simpler than Paxos, but the simplicity comes at a cost. It's not mathematically proven. In our case, we have implemented Paxos in its most mathematically strong form. We can guarantee that the distributed coordination will happen without any data loss in the face of any failure that you can see. And as you point out, algorithms are published well known, implementing those, putting it into production, learning from customer use cases, making it robust. That takes time, and that's where our primary value is. I wanted to talk about something I mentioned earlier, and this is where it kind of gets to my sweet spot of kind of where my heart's at these days. One of those we're always passionate about is our emerging technology, and that's what SiliconANGLE keeps on the cube. We like to go out and look at the exploration areas. And one of the areas it's most talked about right now in highly contested battlegrounds is the software-defined data center. And that is really the moonshot. That's the destination. It's still kind of forming, not even frothy, but certainly hyped up. But network virtualization, these technologies of software have taken the tech business in the cloud, specifically to a whole other dimension. And enterprises are building out on premise new architectures around convergent infrastructure, around flash, and all this is under the hood. Okay, so that's great. That's powering the apps. That's powering the big data. So when you mentioned data centers, you talked about downtime, these are the top conversations amongst the guys really spending the big CapEx and OpEx to re-engineer. You can't have software-defined data center without, you can't have software-defined without a data center. So talk about your role in this positioning and that's working great for you guys around the uptime. And we all know the examples of Netflix going down over zone file or power supply in DC. We all know those horror stories, but they're real. They bring up these concerns. So again, that's kind of on the same thread of continuous operation. Can you expand that dynamic, how the cloud environment with the software-defined data center and all the work going on under the hood and how it relates to WAN disco? Certainly. So you mentioned software-defined networks. I'd like to start with hardware networks of today. As we all know, they take a little bit of time to stabilize, but once they mature, there's hardly any failure in that network layer. The hardware is very stable. You don't have a whole lot of bugs in the system, but the complex software on top of that, that's where all your bugs are. And what the software-defined networks do is to make the hardware smaller and smaller. You get a great deal of bandwidth, but the bandwidth utilization and how it's managed and carved out between applications and quality of services provided, that's where the software-defined networks come in. Now, once you get things into software, the development is much faster, but there is a loss in what I would say, there are more bugs, simply stated. How do you overcome those? If there's a failure in one server, you need to have multiple active servers, so you need distributed coordinated systems to do that. The only correct way to do that is PAXOS, of course. Everybody knows that you can make shortcuts along the way, but if you want to do it right, you need to use PAXOS to establish a distributed coordinated consensus system and build your state machines on top of that. Once you have the software-defined network, then you need to build distributed file systems on top of that. At this point, HDFS is the most successful distributed file system. We've stepped in, we've made HDFS continuously available, and we will continue with this angle of if there's any service in a data center, in a virtual data center, in a fully flexible data center, if there's a single service that serves some crucial function, we will apply our distributed coordination engine and make that service that much more robust. That's how we expand and make it. This is the notion of non-stop Hadoop, right? Absolutely. It's non-stop Hadoop, and it'll be non-stop other data center components in the coming years. Okay, we really appreciate you getting, thank you very much for coming on theCUBE. This is theCUBE live from New York City. That's the CTO, VP engineer, Wendisco. Really excellent positioning, continuous operations. The data center is still where the action is. That's the engine, that's under the hood, that's powering all the applications. That's when it's going to enable this new software environment. This is theCUBE live from big data in New York City. Tons of announcements going on, you got Hadoop world going on, you got Stratoconference going on, a lot of action. We'll be right back with our next guest after the short break. Thank you, Josh.