 Live from New York City, it's The Cube at Big Data NYC 2014. Brought to you by headline sponsor, WAN Disco, with support from EMC, MarkLogic, and TerraData. Now, here is your host, Dave Vellante. Welcome back to Big Data NYC. This is The Cube. I'm Dave Vellante and I'm pleased to have Jim Campigli here. He's the Chief Product Officer and one of the co-founders of WAN Disco. Long time Cube alum. Jim, welcome back to The Cube. It's good to see you again. Thanks, Dave. Appreciate it. Yes, so we were just talking offline about the show down at Jab. It's a much bigger venue this year. You said about 6,000 people? About 6,000 people. I think they had, when they had it over here at the Hilton in Midtown, it was about 3,300, something like that. So it's a much bigger show, a lot more going on. So a lot of growth. I understand the average age is trending toward my age. Well, I mean, you know, you can imagine as Hadoop becomes more mainstream and enterprises adopt it and they expect it to have the same levels of reliability that they're used to in their data centers with the traditional relational databases like Oracle and DB2 and so forth. The people that are implementing those databases and maintaining those databases are probably about our age. You know, independent of WAN Disco and your active, active mojo, financial services seems to really be adopting Hadoop generally. And I want to come back and talk about WAN Disco specifically. But I mean, financial services has always led a lot of tech innovations. But why Hadoop? Why the action? Well, I think if you look at financial services, you look at any of the big banks, they really kind of fit the profile of the 3Vs, volume, variety and velocity. I mean, they've had it in spades for decades and they've had to come up with different approaches to handling that and addressing that. Some of the big banks that we're working with, for example, are trying to figure out, you know, whether or not they're making money in China. And they started discovering that, well, you know, the only way we can really drill down on some of these things is to look at a lot of the contracts we have and they discover the contracts are all in PDFs. Well, you can't really effectively store PDFs and look at that data in a traditional database. You've got to do something like that in Hadoop. A lot of image processing goes on in financial services. A lot of different types of data get handled in huge, massive volumes. So take that use case if we can. Can we just sort of stay on that a minute? So you've got data locked inside of, really valuable data locked inside of a PDF. Right. How do I go from that to a business outcome? Going from that to a business outcome basically means I'm able to go through those documents, understand what they mean and get a profile of whether or not they're working for the institution. And so I can... In aggregate. Yeah, and so... Maybe you'll change your legal policies, maybe you'll change the way you write contracts, the kind of business you do in different parts of the world, that kind of thing. And so talk about how Hadoop enables that versus a traditional technology. Is traditional technology too expensive? It doesn't have the capabilities? It doesn't have the... It doesn't handle the different types of data. You're not talking about fixed rows of data in rows and columns, that kind of thing. It's just not text and numbers. You're talking about a wide variety of data that gets processed. And if you just have something that you access as an attachment through a relational database, obviously there's no intelligence necessarily happening against that PDF document, for example. Now let's talk about the WAN disco piece in either financial services, Right. And then generally you're talking about... You guys focus on active-active 24 by 7 resiliency. Talk about that piece of it. And I'm wondering, are there other use cases that you're seeing for your technology? Yeah, so I mean obviously the thing that we've talked about most in relation to our active-active replication technology, the patented replication technology that we have, is the ability to maintain 24 by 7 availability globally. Effectively what we do is we enable the name node to scale. You can have multiple active name nodes in a cluster, and you can expand those name nodes across multiple data centers. So what you're effectively doing is you're creating a single namespace and then affect a single cluster over a wide area network. And the other thing that that does for you is it gives you the ability to have every application, regardless of where it's running, subject to what selective replication policies you've implemented, have land speed, local area network speed access, and read-write access to the same data. So what does that mean? Active-active replication by default means you're getting continuous hot backup of the data. The other thing it means is because it's full read-write, it means you don't have a read-only backup cluster. So that means you're getting full use of that hardware. Moving beyond that, what you can also do with it is multi-data center ingest, which is otherwise impossible. If you look at a global organization where data is being generated from hundreds of different sources all over the world, what happens today is you have to ingest at one location and very often move it to another for analysis. And what we're able to enable because we've got full read-write access across an entire implementation when you use us, regardless of what data center you're running at, we enable you to ingest and analyze anywhere. So anytime you've got data that's time critical, where accuracy is important, we would be the solution, especially in that kind of distributed scenario. Because typically what happens today is you load it at one location and then you use DCP or some other utility to copy it over to a central data center for analysis. And that means administrator overhead to monitor that data movement. And it also means that if any data is lost in transit, not only have you lost time engaging in that data movement, but you've also run the risk of having inaccurate results. And if you're doing something like gathering sensor data from industrial machinery, locomotives, oil and gas pipelines, monitoring whether or not pressure is building up too high in the jet engines, those kinds of things. Timeliness and accuracy are critical. You don't have time to wait several hours to move the data, make sure it all moved over there, and then hope it all moved over there to do an accurate analysis. We make all those problems go away because you can ingest and analyze anywhere. And people have this misconception, oh, I can run my analytical jobs against my backup cluster because it's not writing any data. Well, that's very often false. Most analytical applications will write data to the Hadoop cluster as part of their processing. It's very common to have a series of MapReduce jobs that write output that are to be picked up by the next job in the process. This kind of thing means that we're the only solution when you have that scenario. So the backup data is out of date in that scenario. Right. And you've also, if you try to do that, you've thrown your backup cluster out of sync with your primary cluster. And what that means to a bank, for example, is the regulators will shut you down until you get it sorted out. So is the high-speed data mover an oxymoron? Is the data mover dead? I think people are still going to use it, but I think anytime you have an application where you're gathering a lot of data from multiple sources and it's time critical and accuracy is important, it's probably not going to do what you need it to do. Yeah, and as volumes of data grow, I mean, the last thing I want to do is be moving data around. And with us, because of the active-active replication, you don't have to worry about it. We automatically do that. The data converges everywhere. And that means that you're not having to worry about clusters getting out of sync because we enforce that as well. But we also enforce selective replication. So if you only, for example, can have data reside in a certain country, if you're, I'll go back to the banks, if you're an international bank operating branches in Argentina, and Argentina says that data can't leave the country, then, of course, you know, you can't move the data out of Argentina, but what you can do with our software is you can access that data to do global roll-up analysis, for example. So we enable those kinds of use cases as well. Well, and that's sort of very much in parallel with the concept of Hadoop. Ship the five megabytes of code to petabyte of data. Don't try to move the data. Now, the value that you bring beyond that concept is 24 by 7, active-active. So I wonder if you can talk about, in a little bit more detail, sort of the customers that you're working with, you don't have to name names, but the types of, you know, maturity model, if you will, where do they start and where do they go and where do you want to take them? Where do they want to take you? In 2013 and 2014 is a lot of companies trying Hadoop out, getting a pilot project going, seeing some real value in it, but then, you know, wanting to deploy it in their data center, and then, of course, they've got internal audit requirements and regulatory requirements around availability, security, data access, all those things impact whether or not you can deploy Hadoop in a production environment and use it for anything that's business critical. So what's happening now is companies are looking for things that address reliability and security with Hadoop. The other thing that they're looking for, and that's financial services, and it's not just financial services, also industrial equipment manufacturers, the example I gave about all the different industrial sensors, the whole Internet of Things concept, they're really looking at this as well, and they're realizing that, you know, data is being generated all over the place. We really can't afford to run the risk of administrator error, time delays, using the approaches we're currently using, so we come in and address those problems. Do you have any competition? There are other solutions, and I'm not going to name names, that present themselves as being active-active, but when you drill down on it, they're really multi-master and eventual consistency solutions. So what I'm basically saying is if that application initiates a transaction, makes a change, and it's connected to one server, then that server in effect becomes the master at least for the life of that transaction, so it's responsible for replicating it to all the other servers in the implementation. If that server where that right transaction originated goes down during that process, what that means is that that transaction didn't get consistently replicated across the cluster, for example. Then you end up in scenarios where you've got to do a lot of consistency checking, the administrator has to get involved in re-syncing the implementation. You'll hear those solutions sometimes described as eventual consistency as well, not guaranteed consistency, but there are other solutions out there that claim to be active-active, but they're really multi-master. In our case, we're true peer-to-peer. In our situation, if you're connected to a particular server at any location, for example, you're updating a file, you're changing the metadata on the name node, we take care of replicating that consistently across the entire deployment across all your data centers, as well as the underlying data blocks. We also address that replication as well, but the point is that in that scenario, you don't run into the problems where you don't have consistent data access that's handled by us automatically, and if one of the nodes fails, for example, the node in particular, where the client initiated that transaction, let's say the name node that the client was connected to goes down before it's replicated to all the other nodes, the remaining nodes in the implementation take care of propagating it to maintain consistency, and then when any of the name nodes come back online, they re-sync automatically. The administrator doesn't have to do anything, and people use that not only to recover from outages, but if they want to take the servers offline for maintenance, and continue providing service to their users, they can do it with our software. They take the server that they're performing the maintenance on offline. The remaining servers continue supporting client requests as soon as it's restored, it automatically re-syncs with the rest of the cluster and you keep moving. So you can have 24x7 worldwide with our solution, which isn't the case with other products. So planned maintenance doesn't cause planned outages? Outages, yeah, no planned downtime. Do you remember the concept of, it's unplanned downtime and we have planned downtime, planned downtime is not okay in this day and age, but it's really all about recovery and continuous availability. And so what you're saying is you're eliminating a lot of the force aspects of recovery of alternative solutions. And you're guaranteeing data quality, data consistency. Interesting. Okay, let's see, what else? You mentioned business use cases, and it's true enough there's a lot of them evolving out there. I'm sure some of your other guests have been talking about them. What's happening in a lot of organizations right now is at the C level, the IT infrastructure folks are saying, we've got to get Hadoop in here. We know we're going to have requirements for it. And then they're saying the line of business people will come to you. So the IT infrastructure people are looking for something that's cost effective, number one. And number two, they know that they're going to get measured on quality of service. So how reliable is the implementation? How well does it perform? How well does it scale? All those kinds of things are going to be areas that they're going to have to deliver on. So other ways that we address that, and this makes this kind of a horizontal solution in that respect, all industries are going to have obviously the same set of requirements as they really deploy this for mission critical applications. Things that we do around this that are kind of byproducts of the active-active replication. Let's talk about keeping costs down. You don't have an investment in idle backup standby hardware waiting for a disaster to occur when you have to bring it into action. It's full read-write everywhere. It's continuously synchronized, so you have that going for you. The other thing that we're seeing a lot of companies use our software for is something called cluster zoning. So if you think about what we actually did, and I'm grossly oversimplifying here, when we applied our active-active replication to enable you to deploy a single Hadoop cluster over a wide area network, what we did is we added a new attribute on top of the rack ID. So if you look at Hadoop, every node in a Hadoop cluster has a rack ID associated with it. The name node knows the rack ID, and so the idea is for added resilience, you replicate in threes, and those three nodes that you're replicating to should preferably be on separate racks. Of course, that's within a single data center. So we took that notion up a step, and we said, well, we're going to add the attribute of a data center ID, and you can selectively replicate data across data centers, and you can also, rather than have that data center, if you will, live on the other side of the world, you can redefine that data center to be a zone within one location, and within that zone, you can selectively replicate the data, for example, that in-memory analytics applications that require the highest spec servers to be routed to. So what that means is if I've got intensive data ingest jobs, if I've got in-memory applications that are cranking through huge transaction volumes, whenever an application needs that data, it will be routed to those high-performance servers, and the rest of the cluster can be commodity boxes, if you will, that handle the run-of-the-mill, map-reduce, batch application kinds of things where they're not so time-critical. You know, if the job goes down, somebody restarts it, we get our results a few hours later, it's okay. And then you can, with that approach, you can maintain quality of service for all your users without being in a position that you would otherwise without this cluster-zoning capability of having to have all high-end servers because you don't know where that in-memory analytics application is going to come in at, and you end up spending a lot more money that way. With us, cluster-zoning enables customers to really reduce their hardware costs and the implementation. Interesting. And that cluster-zoning, Jim, happens all within a synchronous... Yeah, so you can think of it as it's almost like we've created a virtual cluster within a cluster. And, you know, you can do cluster-zoning, you can have separate data centers dedicated to different applications, or you can create this virtual cluster within a cluster within a single data center. And, you know, I've got high-end in-memory applications running at the same time as batch map-reduce jobs. The data that those high-end applications need is going to live on the high-spec servers. But I just need as many servers as I need to support those applications. I don't turn the whole cluster into a very expensive proposition with servers that are specced to support the most demanding applications. I can get away from that. So you're taking this, you know, active-active concept and now you and your customers are starting to find new ways to use it. Absolutely. What else are they asking you for that, you know, you're not delivering today that we should be thinking about in the future? Some of the other things that we're looking at, and we did a press release today, is integration of different Hadoop distributions. So we're working on products that are going to be coming out. We are actually demoing alpha versions of them this week at Strata over at the Javits Convention Center. One of the big ones is a unification solution that basically allows you to do what we do for, you know, a single distribution cluster over a wide-area network for a mix of different distributions where you could integrate Hortonworks and CloudEra, for example, or you could integrate even CloudEra and MapR. As long as that Hadoop compatible platform has an HDFS API, we would be able to plug into it and do the same thing we can do for the standard Apache compliant distros today. Interesting. You guys used to have a, when we first met you, you had a Hadoop distro, which I always found quite fascinating and I realized that what that allowed you to do was have juice in the community and which you guys have always had. I'm sure you're happy you're not in that business anymore, but it's interesting to watch what's happening there, all the money that's being raised and the sort of, you know, the race to knock each other's heads together. You guys have picked a niche that is really specialized, not a ton of direct competition, but maybe other ways of doing it. How do you see that piece growing over time? I mean, today it's obviously a small percentage of the overall requirements, but as people bring on, you know, Hadoop into production systems, is it increasing as a percentage of that marketplace or is it just riding the tide of Hadoop growth? What do you see there? I think what's happening is both. It's riding the tide of Hadoop growth, but it's also increasing as a percentage. Companies discover that there's other value than just active-active replication over when and, you know, thinking of us as a failover and disaster recovery solution, a high-end one, albeit that nobody else can touch with the technology we have. They're starting to see value in things like cluster zoning, multi-data center ingest. If I'm a big industrial equipment manufacturer and I'm capturing all this data from sensors I've got attached to jet engines, appliances, locomotives, oil and gas pipelines, whatever it may be, I realize that I'm, you know, I'm collecting massive amounts of data and I want to be able to analyze it quickly because if there's any downtime, number one, I may miss critical information. If the data takes too long to move to the point of analysis, it loses its value. So they're starting to see that it's about more than just being up all the time. It's about being able to do these things in the center ingest and analyze anywhere. Keep your hardware cost down and maintain the highest quality of service for all your users' applications by doing cluster zoning. Just have the number of high-end servers you need to support the most demanding applications. And then, you know, we talked to a lot of companies where with what they're doing today, whenever they have to support, for example, massive amounts of data to load into the cluster, everything stops for half an hour. They just can do anything else. Zoning is something that could address that problem for them. But you guys have a huge play in Internet of Things, especially for mission-critical infrastructure. Now you're into issues of national security and obviously there's other sort of analytic applications, but is there real demand for that today or is it just sort of an issue? No, I think it's one of those things that you don't know what you're missing until you're aware of it. As people start to understand, I have this kind of problem and I'm just doing this. You know, I'm loading it here. I pay administrators to move it over here, monitor it, make sure it all gets there, deal with any data that's missing, make sure everything's consistent. It's taking a long time. Sometimes I have successes, sometimes I don't. With it, I miss critical information. They're just sort of living with it. But when you make them aware of what they can actually do, it just opens up a whole number of possibilities. We're working with different health agencies that have noticed things, different government health agencies that have noticed things like, when the power goes out in a certain region, all of a sudden food-borne illnesses start to rise. You can see why that would happen. It's one of those things you don't think about, but it's kind of intuitive why that would happen. Obviously, restaurants and grocery stores are still handing out that spoiled food, no matter what, trying to sell it. If government agencies can get in there and realize, we've got to stay on top of this, we've got to be prepared for people coming in the emergency room or whatever. All those kinds of use cases. Prepare for it, quantify it, maybe eliminate it. That kind of thing. Obviously, Internet of Things, industrial sensors, that kind of thing. Obviously, if you know right away that the pressure is building up in a gas pipeline, that a jet engine is indicating some other problem, you see it happening, you can do something about it right at the time. Jim, always fascinating conversations with the folks from WANDISCO. Really appreciate your time coming on theCUBE. We've got to go. We're going to set up for our Capital Markets event, which starts at four o'clock. Right here at the Hilton Times Square. Four to six is the Capital Markets event, and then we go into five-year celebration of theCUBE at Hadoop World. Jim, thanks so much for all your support and coming on theCUBE. I appreciate your time and thanks for interviewing me today. All right, you're welcome. Keep it right there, everybody. We will be back. We're going to take a break, and then we're going to go live at 4 p.m. East Coast time with the Capital Markets event. Jeff Kelly will be presenting new data, and then we've got a great panel, so thanks for watching, everybody. We'll be back right after this word.