 Live from the Fairmont Hotel in San Jose, California, it's The Cube at Big Data SV 2015. Okay, welcome back everyone. You're watching The Cube live in Silicon Valley for Big Data SV. It's our flagship program, The Cube. We go out to the events and extract the signal and noise. I'm John Furrier, I'm my co-host Jeff Frick. Our next guest is Brett Rudenstein, the director of project management at WAN Disco. Welcome to The Cube. Thank you. Good to see you again. So, we love having you guys on The Cube. Great supporters, thanks so much for supporting our mission. Great to see you all at the party and the event we had the other night, it's been fantastic. So, talk about the show here, what you guys are doing. What product stuff is going on that you guys are talking about specifically? Obviously, there's a lot of stuff happening. You know, the enterprise has embraced Hadoop. That's a done deal. What's the next? What's going on in the mind of the customers? Deployments are getting bigger. There's a big rush. The open data platform, big news around Pivotal, Hortonworks, CloudEra, MapR, and you guys are there at the table as well. All the players are moving forward fast. The game of speed and the enterprise is moving fast. What's happening right now and what's coming down the pike? Yeah, and I think if I contrast it with what has happened, you know, moving back, you know, a year, even two years ago, it was we were telling people, you know, you really need Active Active because here's what Active Active brings to you. I think what's changed and what happens now is that people are saying we need Active Active. They're telling us they need Active Active, which is of course great for us. What that really means though, of course we've had our nonstop, you know, our flagship product was our nonstop name node, the name node that allows you to seamlessly replicate, you know, data across multiple geographic territory at any distance. We of course are expanding that offering, allowing you to now to replicate between different distributions of a dupe. We're expanding that offering into HBase, expanding into a nonstop HBase that allows you to ingest in multiple locations for HBase across any geographic distance with the same level of Active Active capability. So it was really about how do I not only just keep the data centers up and running, but how do I maximize the potential, how do I maximize the resource utilization for those data centers? And those are the conversations that we're having. What about the technology innovation? What does the disruptive enable us? Obviously always on, you know, replication, having basically multiple data centers talk to each other. That's a trend we're seeing in the virtualization space. Now in big data, as data moves around, what do customers deal with right now? What's the top three priorities? You know, I think, you know, security is a top priority. The consistency of the data is an absolute priority and then of course maximization of the resource that they're using as a top priority. So Jeff, what's your take on this? Because you are talking with treasure data and some others in the industry, we've seen all the questions. Machine learning's hot. What's your observation? It's just fun now. It's not really a question. It's how, how do I get started? Where do I get started? How do I move? And now we're really moving into the business conversations. It still seems kind of early days for some of them in terms of the trials of getting it up and getting it going. But clearly, you know, we had you on big data, NYC in 2013, which as somebody said it, I showed you the other day. I remember way back in 2013, and you know, while how the world has changed and really progressed since then, it's good to see. Yeah, so talk about the continuous message you guys have had. You just had some message you guys have been banging on. Any updates there on terms of the product side? Well, you know, I was just kind of alluding to some of them. Of course, we can continue to evolve the main product, the nonstop name note product. We have a deeper integrations into Cloudera manager with deeper integrations into Embarie with, you know, offering our own UIs to give you a global view of the data centers and of the Hadoop that spans multiple data centers. But as I mentioned, you know, we're continuing to innovate. We're bringing that technology and that capability to HBase. And we're also bringing that technology to allow you to do this kind of replication between different Hadoop distributions. And this will really be very interesting because. Talk about the success you've had, because this is interesting. You guys have had a great journey. And it's been a turbulent time for the industry. People have distros, they don't have distros. You guys have had an interesting approach. Your partnering has been very specific on a technical product basis. Obviously Cloudera, Hortonworks, Oracle, IBM, and Apache. How has that worked out for you? And what have you guys learned from that? And what's the update? I mean, you're on all those platforms, right? So what's coming out of that from a customer standpoint? Is it more mix and match, choice? What have you guys learned from that? And what's the big to-do items for you? I mean, I think the things I understand is, or what we learn is that, you know, choice is a good thing. You know, customers want choice. They want to be able to run on any distribution that they want to be able to. By us offering our product on all those different distributions, it gives them the choice. When they need active-active, when they're ready for that, they come to us. Where we need to go with that, and what we have done with the Altar Store product that I think Jigain talked about earlier, is to be able to do that same level of data replication between different vendor distributions. And that, of course, opens up a whole new realm of possibilities. So I think we talked about this last time, but it came up again in another conversation at an IBM event. We were at, you know, the flooding in New York City. Quite a lot of people asked, certainly it's a financial district. So there were some real issues there. People were on, like, three days without sleep. And the issue was, you had to move from the data center because there was a lot of problems because you got the flood data centers. That highlights this replication piece. What are customers doing today in terms of prevention so that if they need to put up another data center and what role do you guys play in that? I mean, I'll talk to what kind of we help with. And, you know, when you talk about areas like New York, especially in the wake of things like a Hurricane Sandy, where both New York and backup data centers in New Jersey would go down, of course, when those two go down, everybody's out. One of the limitations, of course, is distance and not only is distance a problem, but obviously the consistency of the data. What our technology really helps people do is it allows them to have a strongly consistent metadata system and a strongly consistent data system. Therefore, Hadoop, HDFS, that is. And to take that and span that across any geographic distance. So now instead of trying to do these things at metropolitan scope, people can do them at continental scope or cross-continental scope. What do customers struggle with the most when it comes to this area that you find that you guys solve the problem that's unique to you guys that's differentiated? I think it's consistency. You know, sometimes people come up to us and they say, oh, you do backup and disaster recovery. I look at it a little bit differently. I think what we do, I think disaster recovery and backup is sort of a byproduct of what we do. What we allow people to do is to have a consistent namespace across any territory and therefore be able to fully utilize resources. That means ingest in all locations that participate. That means run data and get current results. It allows you to open up things like Lambda architectures where you can have a batch zone or a batch cluster in a real-time zone and truly kind of get the most current up-to-date answer for the question that you're asking the system. So what do you got for us today? We always love having you on. You got something new and different to show us. So what are we going to see today? You know, I think I've always kind of done failure demos so, you know, I break things and things still continue to run. So why should this be any different? Show something running. Don't get anywhere near the production desk. They don't like it. I would like to take a slightly different, at least a slightly different approach to that. And that is what I'd like to do is kind of talk about, you know, what it means to be globally consistent, what that really allows you to do. Some of the things that we've just talked about. And in addition, you know, how you actually manage a system that's globally consistent. How do you manage, you know, two Hadoop clusters, three Hadoop clusters that span multiple geographic territories? So I guess I'll... So are the guys ready, you guys ready? Yeah, we're good. All right, excellent. So I'll start with, if you're looking at the screen here, you can see I'm in Cloudera Manager. And one of the things, it's a very simple cluster here to keep the visual representation simple. But one of the things that you'll notice is that there's an NSNN service directly inside of Cloudera Manager. And when you look at that service, it gives us visibility into each one of the name nodes and it gives us the ability to configure all those name nodes. So now managing our system can be done directly through the Cloudera Manager interface. We also like to offer a globally consistent view of the entire system. So we've offered our own dashboard. And here you see it, it's based, you know, it's sort of Cloudera themed in this particular case. But rather than looking at a single data center, what it allows us to do is look across multiple data centers and see outliers. For instance, all the name nodes in our system are approximately using the same amount of heap. They're using the same amount of HDFS capacity because I'm essentially replicating everything. But we do see some outliers here. For instance, in data center two and data center one, two of the nodes are very high CPU. And this lets people know that there are some issues that are happening in that cluster and that they can act on those things. We also of course offer this up available for in addition to Cloudera with the Hortonworks stack as well. So here if you look at the Embary interface, you can see all of the healthy name nodes over here. In addition, in our nonstop UI, you see the same kind of data, the same kind of outlying information. We give people global control over the access to their data. So for instance, in our selective replication tab, they can choose which directories replicate into what location. So for instance, I could easily create a new rule that took this particular directory user Brett and then maybe this private directory and say that it only replicates to data centers one and data center three. That it's allowed to read access globally or what we call when read or that it's not allowed to do that. And by saving that, it's instantaneously active, active across all clusters, across all zones. One other view of course that we provide is a view of all name nodes. In this case, you see a world map. The world map essentially shows us all of the nodes that are participating. It shows their geographic location and color code, whether or not there's any problems with any one of the nodes. And of course I have control over those things. So the last part of the demo that I'd like to show you here is what it then means to be globally consistent. So I've got two windows open on the right hand side. And what I'm gonna do is I am going, one of these windows is open in the North Virginia data center and the other one's open in the Oregon data center. And I'll show you what global consistency means and what it fundamentally means is that the application or the behavior of the application is unchanged. Meaning whether I'm at landscape or wandscope, whatever I do will have the exact same behavior. So let's do this. I've got a small script here. Let me just kind of cat it out on the screen here. It might be a little hard to read, but basically it's just to put, I'm putting my host file in a directory inside of HDFS. And what I'm going to do is I'm gonna put that exact same file in both locations at the exact same time. What you'd expect to happen in a globally consistent system is that one of those files will get created and the other will get file already exists. So let's go ahead and do that. I'll kick off the first one. I'll go ahead and I'll kick off the second one. Try and do them at the close to same time as possible. And of course, one of the things that you see here is that in the top window, because I executed it just slightly ahead of the other one, it actually gets file created, but there's no eventual consistency in the system. It's strongly consistent. And therefore I got file already exists. Now, what does that mean on a larger scale? If I run a small map-reduced job, so I'll just do a quick generation of about 400 megabytes of data. As this runs, what you'll be able to see, of course, is that all of the name nodes will participate in the global replication. HDFS will replicate at the 3,000 miles distance between these two. When the job completes, I'll immediately sort it. What's even more interesting is that some of the blocks actually won't have finished replicating. Some of the blocks will still be in flight. And yet I'll still be able to run map-reduced on that in a consistent manner, because it's a global namespace, and therefore I have access to both local blocks and foreign blocks. While we're waiting. Yeah. What does that global namespace mean specifically to this demo? Obviously, global namespaces give you some hierarchy, but has it relate to different namespaces and has a true verse through that? Yeah, I mean, Hadoop has a namespace. The name node is where Hadoop stores all its metadata. And when we say a global namespace, normally what you have is two different clusters. They have two different namespaces and trying to synchronize data between the two. What WANDISCO does is we provide a single namespace. In other words, it's one HDFS namespace that's in all the different data centers. And what- And the benefits of that is what? Consistency. Never will you have to consistency check and say, you know what, I'm going to spend three days, you know, consistency checking a multi-pedabyte cluster and then running a consistency check on the opposite side and trying to see where they might be different. HDFS handles the underlying checksums of data. The file system is strongly consistent always and uses Hadoop mechanisms to do so and HDFS mechanisms to do so. You can see the job in data center one finished. So I'm going to go ahead and sort it right now. And once we go ahead and we kick off the sort, even though we still have data, let's see if there's still any data replicating. Yeah, you can see over here, data center two still has four blocks that are replicating while I started this job. But what will happen, of course, is we will be able to complete the job. And that's what it means to be active-active, to be able to ingest. And this job is writing data into the cluster. So it's now ingesting in the Oregon data center. Blocks are now replicating in both directions. In fact, if I just do a quick refresh on the screen here, you can see I've got, that looks like the blocks in data center two completed, blocks in data center one are replicating in the opposite direction. It's active-active ingest. That means I can fully utilize cluster resources. And what that means for organizations is when you have, you know, a million dollars worth of resource in data center one and some percentage all the way up to a million dollars of resource in data center two that is sitting there idle doing nothing. This is now all available resources for production level runs. And is that how people measure the ROI of these types of projects? That's certainly one of the measures, right? If that resource is suddenly available, then there is a great cost significance as opposed to having to add additional data center, you know, additional compute in one data center. And of course, if you add to one, it means once again, you have to add to the backup cluster, yeah. So this job is just about finish. And before it finishes, why don't I go ahead and kill one of these name nodes to show a failure. I'm in data center two, so I'll go ahead and. I'll just power one of these guys down. It's like first person shooter. It's the new game, kill a name node. It's, you know, see what happens. I've thought, you know, kill the name node would be an interesting, I think Paxos would be a great name for video game, right? Just don't lag, you know, and then you, good, good, continue. And that's kind of going to bring us towards the end of the demonstration here. Of course you can see from the global map, one of the nodes is complaining. One is a little yellow triangle indicating something's wrong somewhere in the globe. It's not, the system is not down. It's still available. It's still available from both locations. The job is still running. In fact, you see that it's just completing now, but I now have an action item to attack and to bring that node back online. Excellent. And so, bottom line, what's the benefit of that demo? Why is that important? Quick summary. Yeah, quick summary. It's important because this is the only technology that allows you to fully utilize all your cluster resources, to allow you to ingest data in both locations, to allow you to create cluster zones. In other words, be able to do those lambda architectures, those mixed workloads, to have heterogeneous clusters, all operating in the same HDFS namespace. Brett, thanks so much. Really appreciate it. So tell the folks out there, you know, we're thinking, see more demos. What's resources are available? Obviously it's awesome stuff. What's the URL, site, shows you got coming up, road shows or any of the resources out there? We, of course, are at the show today. You know, come by our booth, go to Wendisco.com. And of course, you can always reach us by the regular channels. And of course, we'll have all the big data shows that are coming up. Brett, great. From Wendisco, thanks for sharing that. This is theCUBE, we're live in Silicon Valley. We'll be right back after this short break. I'm John Furrier with Jeff Furrier here, live at Big Data SV in conjunction with Stratoconference and Hadoop World. We'll be right back.