 Live from New York City, it's The Cube at Big Data NYC 2014. Brought to you by headline sponsor, WAN Disco, with support from EMC, MarkLogic, and TerraData. Now, here is your host, Dave Vellante. Welcome back to Big Data NYC, everybody. I'm with Jeff Frick, my co-host for the rest of the day here. We've been going wall-to-wall Thursday and Friday. This is The Cube. The Cube is our live mobile studio. We go out to the events. We extract the signal from the noise. I'm going to make a prediction. I'm predicting that all these disparate Hadoop distributions are going to come together in one unified platform amazingly. And Brett Rudstein is here from WAN Disco to talk about it. We're laughing because WAN Disco has this new product that actually makes things a lot easier to manage. We're going to talk about that in a minute. Brett, welcome back to The Cube. It's good to see you again. Thank you, everyone. Good to be here. Good to see you both. So we were talking, you said the show's good. We're, of course, at the New York Times Square Hilton. The show's down at the Javits Center. There's a little bit of traffic in between here and there. But what's the show like? The show's really incredible this year. The traffic, just general traffic, of course. It's a well-attended show. Traffic at the WAN Disco booth has been exceptional. The discussions that we're having with customers. The problems that we're able to solve for them around the active, active nature of our solution. Being able to maximize capacity and utilization in multiple data centers. Not just have standby and out of resource. Really resonating well with the crowd. It's even very interesting. There was a talk that Jagain gave just a little bit ago. It was yesterday. And I've never seen one of those attendance halls with respect to our technology completely full. There pretty much wasn't a seat left in the room. And the questions just kept coming non-stop about how our technology works and how it solves these problems. We had a tremendous amount of interest. Really? Generally speaking, you guys have always said solve a really hard problem. But it's kind of a niche in the dupe world. But you bet that the market's going to come to you. Is that happening? Yeah. I think what we've seen is a level of maturation in the community. And such that they're no longer thinking about the traditional backup disaster recovery. That's certainly an aspect of what we're capable of doing. But they're now realizing some of the extended capabilities that they get. Not only just running active-active, but the ability to have mixed heterogeneous clusters. Different capacity of machines running. So for example, one of the things that we classically see is some legacy nodes that maybe only have 48 gigabytes of RAM in them. And they're really good for batch-oriented work. But some of the newer stuff that people are bringing online, they're bringing machines with 128 and 256 gigs of memory. They can't put them necessarily in the same cluster. So they can put them in this concept that we have called a zone, which is just our replication technology, creating a level of isolation and allowing mixed workloads to be able to essentially cross the clusters a single unified HDFS, one HDFS namespace, but isolate the work in the individual zone. So extending the use case that you'd normally think of for WAN disco and presumably expanding the TAM as well, which is always a good thing. Okay, so you're a demo guy. You're always showing great demos. We've done several with you now, and they're always really interesting. So what are you going to show us today? I'm going to show you a couple of things. I'm going to show you some new products that we have that we're working on right now. We've got something called Data Center Aware Yarn. So we've got a Yarn product that's capable of understanding the best and most appropriate location to run the jobs when you have active-active clusters, you have additional capacity. So being able to use that without having to think about which resource should I actually use. So let's talk about why that's important. So Yarn stands for yet another resource negotiator for those of you who really don't follow this stuff in great detail. But the nice thing about what you're talking about is, if I understand it, I'm eliminating the need to move big chunks of data to places to actually run a job. Is that right? It's more about the fact that when you have a replicated HDFS, it's the ability to run jobs in either location. Now the question becomes, which location should I run it in? Now we have another capability of our product called Selective Data Replication, and that's the ability to determine while we maintain a 100% consistent namespace, the ability to determine whether or not data leaves a particular location. So for reasons of foreign policy regulation where data can't leave one particular data center, we can selectively choose where that data replicates. When it comes to Yarn then it's important to understand when you run a job that that Yarn resource is executing in the correct data center. So a good example would be Germany. They've got very strict laws about moving data outside the country. So you can ensure that that particular job will be run and the resources run locally as an example. Absolutely. Okay, show us what you've got. Alright, so if you're looking at my screen, you see a graphing application, probably not dissimilar from some of the graphing applications you've seen in the past. The only difference this time is that typically the graphs that you've watched from our previous demos, they show name note activity. In this case, we are showing the memory utilization and the CPU utilization in each of the data centers of the collective node managers. So basically the data nodes, how busy the data nodes are. And as you can see right now, our Oregon data center is not doing much. Our Virginia data center is not doing much. And here's kind of the scenario I'm going to play out. I'm going to have a bunch of users come into the system, start executing jobs, and we'll watch the system load balance the work when that's necessary, when we're at capacity in one of those particular data centers. So let's start off. I'm going to issue a little script called user1. And user1 comes in first thing in the morning. They execute a map-produced job. By the way, all three of these jobs that I'm about to run execute across the exact same data source that's replicated between the Oregon and Virginia data centers. User1 comes in to Oregon. All three of these windows on the right-hand side of my screen, they're all open in the Oregon data center. And so I've kicked off the job. And what you'll notice in a minute is that that job really starts, you'll start to see the memory in the CPU pickup in the Oregon data center. Along comes our user2. And user2 starts a job. And you've just started to see a little bit of pickup from user1. User2 begins the job. And now we're going to pass that threshold that we've set. I've got it set about 65, 70% or so. If we cross that threshold, then another job that comes into the system should technically move across. And now we're at about 100% resource utilization comes user3 also in the Oregon data center. I run the job. And of course, I've utilized all of this capacity at this point. So what you'll see is even though I've executed all of these jobs in the Oregon data center, in just a moment, that job will be rescheduled or relocated to run in the Virginia data center. And you'll have a maximum utilization of resource by running jobs across both. Just give that a minute to spin up and you'll effectively see that load balance. Okay, and all this is happening dynamically, obviously. I don't have to manually allocate. And that's exactly the point is the fact that we've crossed this threshold here. We're at about 100% CPU again or 100% CPU using up a bunch of memory. Now you see this other job just kick up and we're seeing the balanced utilization. It's a load balancing of job resources and scheduling. So you've been running this presumably at the booth and sharing it with customers. What kind of questions do they ask you when they see this? They ask about what kind of flexibility do they have? Can we have flexibility to control whether or not... What if we want a job to run on a specific data center and it's okay to wait? Yes, you can do that. So you have some knobs that you can turn to control. Now what's happening here? So now you can basically see all three jobs are running and we're utilizing because the CPU is pretty much fully utilized across both data centers. You can see that the jobs are indeed running across both data centers despite the fact that I've run it all from the Oregon Data Center. Now what's interesting, if you start thinking about yarn capacity queues, these are hierarchical queues where you can say for my mission critical applications, they can take up 70% of the cluster resource. But for less critical applications or for research and development, they can use 30% of the resource. Well what we can then do is say if you fill up that small queue, we can spill you over into one of the other data centers and utilize the capacity there. So some really, really interesting use cases come out about being able to control capacity queues, load balance across the capacity queues at a much finer granular, more granular level of detail. Interesting. So extending sort of the use cases for active-active beyond sort of the traditional way in which you think about WAN disco. Yeah, exactly. And so I'm going to show you one more use case now. Okay. And that is you can see all these jobs are pretty much finished. We've just about, you know, the cluster load on both sides of the country, if you will, has dropped down. What about that data sovereignty case that we talked about? What about when data only exists in one data center? So let me open up a small window over here. And what I'm going to do is I'm going to issue a command that is going to show us, I'm just issuing an LS command about this sovereign data that only exists in one data center. You'll notice I ran it across both data centers. So I ran it in Oregon, I ran it in North Virginia. But if you notice, the replica count from the stuff that's highlighted in green here has three replicas in the North Virginia data center, zero replicas in the Oregon data center. So now, if I kick off a job in the Oregon data center, it should automatically switch over to the North Virginia data center because that's where the data lives. And that's exactly what I'm going to show you. So there's plenty of capacity to run the job now. User four comes along and executes their data against that data that only lives in the North Virginia data center. And what you'll see of course is, despite the fact that there's plenty of capacity in Oregon, the Virginia data center will actually handle this load because that's where the data is. It's selectively replicated only in this location. Okay, so now, talk about what customers are asking you about this one and how they're planning on using it. The first thing is, there's a couple of reasons they use selective data replication. The first is that they don't want to replicate data that's unimportant to them. That doesn't need to be replicated. That could be the trash, temporary intermediary files, files that are easy to regenerate, any kind of temporary data, if you will. The next kind of data that they don't want to replicate is that case that we spoke about a few minutes ago, the data that can't leave a country, can't leave Saudi Arabia, can't leave Venezuela, Germany, as you mentioned. So this is a very key and crucial requirement of just about every one of our customers is to be able to ensure that data can't leave a location that it stays in that location, but that there's still consistent access to it. Interesting. You hear a lot about cloud and elasticity and the ability to spin up resources and essentially you're bringing that type of agility to your world. These could be on premise, they can be in the cloud. You don't really care, right? Absolutely. Interesting. And it really brings out that kind of ability to do things like really sort of recognize the data reservoir, if you will. You have these other bodies of water, if you will, the data streams for sending data to a location, data tributaries for, as feeder sites, to bring data into another location. We can really take that analogy a long way because we can really create the data ocean with our active activity knowledge. And the key is the same data source that's consistent in all cases. Yeah, absolutely. I've got one more short demo if you're interested. The last demo, of course, is this unification layer that you made reference to when we started. My prediction. This is a very short demo. It'll just take a few seconds. But maybe the first thing to do really quickly is to just show you a name node UI. You can see I'm on a machine here called 1Hadoop 6. That's the name of that machine. You can see you're looking at the name nodes Web UI, but you can probably recognize from the top blue banner it's colored in that nice dark blue color. You probably realize that you're looking at a Cloudera CDH name node Web UI. Now, I've got another machine here. It's called 1Hadoop 1. This is machine 1 over in the North Virginia data center. And you notice it's a name node Web UI, but it has a green banner across the top indicating that it's running HTTP. So I've got two different clusters, same data locals, Northern Virginia and Oregon, one running CDH, one running HTTP. And the demo is going to be as follows. It'll be a very short demo. I'm going to open up this window here. And we're just going to run a quick job. I'm going to create this job here. So it's just going to be a quick MapReduce job. It's called the cube. And when this job completes, what you'll see is that all of the data blocks replicate from the North Virginia side to the Oregon side in the same active-active capacity that we brought you with the non-stop-adoop product. You now have the ability to do active-active configurations across multiple distributions. Okay. And this works for any of the major distros or all the distros? Well, again, this product is still being built. So this will likely come out first across the major distribution vendors, but effectively it's a way for us to include any distribution anywhere. That's powerful. All right. All right, Brett. Anything else you got for us? No. I guess maybe the last thing is just to kind of show you that the replication actually occurred. Okay. I'll do a quick LS on the two directories between the two data centers. So let's just do an LS on web logs, Terra, in, cube. And see that the data was actually replicated across those two locations. And that's what I have for you. I picked the right location. Of course, that always helps. Other questions, Dave? So you said that this is a product that's in beta now. Is that right? It's technically in alpha right now. It's in alpha. Okay. Can you share with us the rollout schedule or the plans? You know, we've got a number of requirements to address. We'll probably start seeing this next year, probably in the Q2 timeframe. Excellent. All right, Brett. And there it is, just as a kind of last thing, data replicated across the two clusters. Of course. You've never had a demo fail on us, unlike Bill Gates. All right. All right, Brett. Thanks very much. Thanks, guys. Good to see you. I was a pleasure to see you. Thank you. Cheers for stopping by. All right, keep it right there. We'll be right back with our next guest right after this.