 From New York, it's theCUBE. Covering Big Data New York City 2016. Brought to you by headline sponsors, Cisco, IBM, NVIDIA, and our ecosystem sponsors. Now, here are your hosts, Dave Vellante and Peter Burris. Welcome back to theCUBE, everybody. This is Big Data NYC and we are covering wall to wall we've been here since Monday evening. We were in NVIDIA, NVIDIA, talking about deep learning, machine learning yesterday. We had a full slate. We had eight data scientists up on stage yesterday and then we covered the IBM event last night, the rooftop party. Saw David Richards there hanging out with him and wall to wall today and tomorrow. Jigain Sundar is here, he's the CTO of Wendisco. Great to see you again, Jigain. Thanks for having me, Dave. You're welcome. It's been a while since you and I sat down. I know you were in theCUBE recently at Oracle headquarters which I was happy to see you there and see the deals that are going on. You got good stuff going on with IBM, good stuff going on with Oracle. The cloud is eating the world as we sort of predicted and knew but everybody wanted to put their head in the sand but you guys had to accommodate that, didn't you? We did and if you remember us from a few years ago we were very, very interested in the Hadoop space but along the journey we realized that our replication platform is actually much bigger than Hadoop and the cloud is just a manifestation of that vision. We had this ability to replicate data strongly consistent across wide area networks in different data centers and across storage systems. So you could go from HDFS to a cloud storage system like S3 or Azure, Wasbi and we would do it with strong consistency and that turned out to be a bigger deal than actually providing just replication for the Hadoop platform. So we expanded beyond our initial Hadoop foray and now we're big in the cloud. We replicate data to many cloud providers and customers use us for many use cases like disaster recovery, migration, active, active, cloud bursting, all of those interesting use cases. So anytime I get you in a queue but I like to refresh the 101 for me and for the audience that may not be familiar with it but you say strongly consistent versus you hear the term eventual consistency. What's the difference? Why is the latter inadequate for the applications that you're serving? Right, so when people say eventually consistent what they don't remember is that eventually consistent systems often have different data in the different replicas and once in a while once every five minutes or 15 minutes they have to run an anti-entropy process to reconcile the differences. And entropy is the thermal randomness. If you go back to your physics, high school physics. What you're really talking about is having random data and once every 10 minutes making it reconciled and the reconciliation process is very messy. It's like last right wins and the notion of time becomes important. How do you keep time accurate between those? Companies like Google have wonderful infrastructure where they have GPS and atomic clocks and they can do a better job but for the regular enterprise user that's a hard problem. So often you get wrong data that's reconciled. So asking the same query, you may get different answers in your different replicas. That's a bad sign. You want it consistent enough so you can guarantee results. And you've done this with math, right? Exactly. Our basis is an algorithm called Paxos which was invented by gentlemen called Leslie Lamport back in 89 but it took many decades for that algorithm to be widely understood. Our own chief scientist spent over a decade developing this adding enhancements to make it run over the wide area network. The end result is a strongly consistent system mathematically proven that runs over the wide area network and it's completely resistant to failure of all sorts. That allows you to sort of create the same type of availability, data consistency as you mentioned, Google with atomic clock, Spanner I presume, this fascinating, I remember when the paper came out I was, my eyes were bleeding, reading it. But that's the type of capability that you're able to bring to enterprises, right? That's exactly right. We can bring similar capabilities across diverse networks. You can have regular networking gear, time synchronized by NTP, out in the cloud, things are running in a virtual machine where time is adrift most of the time. People don't realize that VMs are pretty bad at keeping time. And all you get up in the cloud is VMs across all those environments. We can give you strongly consistent replication at the same quality that Google does with their hardware. So that's the value that we bring to the Fortune 500. So increasingly enterprises are recognizing that data has an, I don't want to say intrinsic value, but the data is a source of value in context all by itself, independent of any hardware, independent of any software. That it's something that needs to be taken care of and you guys have an approach for ensuring that important aspects of it are better taken care of. Not the least of which is that you can provide an option to a customer who may make a bad technology choice one day to make a better technology choice the next day and not be too worried about dead ending themselves. I'm reminded of the old days when somebody who was negotiating an IBM mainframe deal would put an Amdahl coffee cup in front of IBM or put an Oracle coffee cup in front of SAP. Do you find customers metaphorically putting when disco coffee cup in front of those different options and say these guys are ensuring that our data remains ours? Customers are a lot more sophisticated now. The scenarios that you pointed out are very, very funny. But what customers come to us for is exact same thing. The way they ask it is I want to move to cloud X but I want to make sure that I can also run on cloud Y and I want to do it seamlessly without any downtime on my on-prem applications that are running. We can give them that. Not only are they building a disaster recovery environment often they're experimenting with multiple clouds at the same time and may the better cloud win. That puts a lot of competition and pressure on the actual cloud applications they're trying. That's a manifestation in modern cloud terms of the coffee cup of a competitor in the face that you just pointed out, very funny. But this is how customers are doing it these days. So are you using or are they starting to, obviously you were able to replicate with high fidelity with strong fidelity, strong consistency, large volumes of data. Are you starting to see customers based on that capability actually starting to redesign how they set up their technology plant? Absolutely. When customers were talking about hybrid cloud which was pretty well hyped a year or so ago, they basically had some data on-prem and some other data in the cloud and they were doing stuff. But what we brought to them is the ability to have the same data both on-prem and in the cloud. Maybe you had a weekly analytics job that took a lot of resources. You'd burst that out into the cloud and run it up there. Move the result of that analytics job back on-prem. You'd have it with strong consistency. The result is that true hybrid cloud is enabled only when you have the same exact data available in all of your cloud locations. We're the only company that can provide that. So we've got customers who are expanding their cloud options because of the data consistency we offer. And those cloud options obviously are increasing. They are. But there's also a recognition that it's, as we gain more experience with cloud, that different workloads are better than others as we move up there. Now Oracle may, there are some of their announcements last week, may start to push the envelope on that a little bit. But as you think about where the need for moving large volumes of data with strong consistency, what types of applications do you think people are focusing on? Is it mainly big data? Or are there other application styles or job types that you think are going to become increasingly important? So we've got much more than big data. One of the big sources of leads for us now is our capability to migrate NetApp filers up into the cloud. And that has suddenly become very important because an example I'd like to give is a big financial firm that has all of its binaries and applications and user data in NetApp filers. The actual data is in HDFS on-prem. They're moving their binaries from the NetApp up into the cloud in a specific cloud vendors equivalent to the filer and the big data part of it from HDFS up into cloud object store. We are the only platform that can deal with both in the strong consistent manner that I've talked about and we're a single replication platform. So that gives them the ability to make this sort of a migration with very low risk. One of the attributes of our migration is that we do it with no downtime. You don't have to take your online, your on-prem environment offline in order to do the migration. So they are doing that. So we see a lot of business from that sort of migration efforts where people have data in NAS filers, people have data in other non-HDFS storage systems. We're happy to migrate all of those. Our replication platform approach, which we've taken in the last year and a half or so is really paying off in that respect. And you couldn't do that with conventional migration techniques because it would take too long. You'd have to freeze the applications. A couple of things. One, you'd probably have to take the applications offline. Second, you'd be using tools of periodic synchronization varieties such as R-Sync. And anybody in the DevOps or operations who's ever used R-Sync across the wide area network will tell you how bad that experience is. It really is a very bad experience. We've got capability to migrate NetApp filer data without imposing a load on the NetApps on-prem. So we can do it without pounding the crap out of the NetApp servers such that they can't offer service to their existing customers. Very low impact on the network configuration, application configuration. They can go in, start the migration without downtime. Maybe it takes two, three days for the data to get up over there because of the man link. After that is done, you can start playing with it up in the cloud. And you can do the cutover seamlessly so there's no real downtime. That's the capability we're using. But you also mentioned at least one data type, binary, they can't withstand error propagation. Absolutely. And so being able to go to a customer and say, you're going to have to move these a couple of times over the course of the next end months or years as a consequence of the new technology that's now available. And we can do so without error propagation is going to have a big impact on how well their IT infrastructure, their IT asset base runs in five years. Indeed, indeed, that's very important. Having the ability to actually start the application, having the data in a consistent and true form so you can start, for example, the database and have it mount the actual data so you can use it up in the cloud. Those are capabilities that are very important to customers. So there's another application. If you think about, you tend to be more bulk. The question I'm going to ask is, and at what point in time is the low threshold in terms of specific types of data movement? Here's what I'm asking. IoT data is a data source, is a use case that has often the most stringent physical constraints possible. Correct. Time, speed of light has an implication but also very importantly, this notion of error propagation really matters. If you go from a sensor to a gateway to another gateway to another gateway, you will lose bits along the way if you're not very careful. Correct. In a nuclear power plant, that doesn't work that well. Yeah. Now, we don't have to just look at a nuclear power plant as an example but there's increase in the industrial IoT is starting to dramatically impact not just life and death circumstances but business success or failure. So how, what types of smaller batch use cases? Do you guys find yourselves operating in places like IoT where this notion of error control, strong consistency is so critical? So one of the most popular applications that use our replication is Spark and Spark Streaming which as you can imagine is a big part of most IoT infrastructure. We can do replication such that you ingest into the closest data center. You go from your server or your car or whatever to the closest data center, you don't have to go multiple hops. We will take care of consistency from there on. What that gives you is the ability to say I have 12 data centers with my IoT infrastructure running, one data center goes down, you don't have a downtime at all. It's only the data that was generated inside the data center that's lost. All client machines connecting to that data center will simply connect to another data center, strong replication continues. This gives you the ability to ingest at very large volumes while still maintaining the consistency and IoT is a big deal for us, yes. We're out of time but I got a couple of last minute questions if I may. So when you integrate with IBM, Oracle, what kind of technical issues do you encounter? What kind of integration do you have to do? Is it lightweight, heavyweight, middleweight? It's middleweight I would say. IBM is a great example. They have deep integration with our product and some of the authentication technology they use was more advanced than what was available in open source at that time. We did a little bit of work and they did a little bit of work to make that work but other than that it's a pretty straightforward process. The end result is that they have a number of their applications where this is a critical part of their infrastructure. Right and then roadmap, what can you tell us about? What should we look for in the future? What kind of problems are you going to be solving? So we look at our platform as the best replication engine in the world. We're building an SDK, we expect custom plugins for different other applications. We expect more high-speed streaming data such as the IoT data. We want to be the choice for replication. As for the plugins themselves, they're getting easier and easier to build so you'll see wide coverage from us. Right, Jigain, thanks so much for coming to theCUBE. Always a pleasure to have you. Thank you for having me. You're welcome. All right, keep it right there, everybody. We'll be back to wrap. This is theCUBE, we're live from NYC. Right back.