 Live from New York, it's the Cube, covering Big Data NYC 2015. Brought to you by Hortonworks, IBM, EMC, and Pivotal. This is George Gilbert, we're at Big Data in New York City, running in conjunction with Strada 2015. And I have two special guests with us, Scott now from Hortonworks, who's CTO, and Joseph George. It's a long title, so I have to achieve it. Executive Director of High Performance Computing and Big Data Solutions. Thank you. Welcome, guys. Thank you. So, it turns out, something that I guess I did not know, and I'm guessing a lot of folks didn't know that HP and Hortonworks collaborated on Yarn, the foundation of what we essentially call Hadoop 2.0. Why don't you tell us a little bit about how that came about and how you're thinking about the roadmap for that? Yeah, absolutely. I could start. HP has been and continues to be focused on solutions and empowering the data-driven organization. We've for a long time been pioneers in getting the server architecture right, and we do that with our Proline DL380 server. It's the most popular Hadoop server in the world. But we have now taken it to the next level with more purpose-built platforms like the Apollo 4000 line, and we've worked on architectures like the Big Data Reference Architecture, where we separate out compute and storage for Hadoop clusters, a very foreign concept right to the community in general. And we wanted to also do our part in contribution to the community, and we noticed that this whole dynamic allocation of resources was something that we could do, we could bring some of our expertise to. Just to bring for our viewers who might be struggling for the concrete angle, dynamic allocation of resources. In other words, you've got different jobs that need claims on that underlying hardware. That's right. And without yarn, you had to give them everything in the cluster. Every component kind of looked the same, right? And as we as a community continue to advance, our objective was to try and figure out how we could develop software that could help solve some problems in the community. And in this particular instance, we realized that if you looked under the covers, all servers are not created equal, right? Some are much more storage oriented, some are maybe a previous generation. What yarn enables you to do is to be intelligent about where you would allocate certain workloads and what parts of the cluster they would run on, certain things that are more intensive from a CPU perspective, we can allocate to different parts of the cluster. So how would that look when you've got, let's say, an interactive workload where you want a user response, let's say a Hive or an Impala, and then something where you've got a big batch job, you've got a window still to get it done. But how do you balance, what does it look like to the administrator? What's it look like to the program? Sure. And, you know, Scott, feel free to chime in here. But there's two ways that we can do it right now. Obviously, on Bari, there are ways you can actually do that programming so it can be responsive to what's happening in the cluster. And HP also has a product called cluster management utility, CMU, which does that a little bit more graphically oriented. And so you've got ways to set up your cluster and code it so that it can be responsive to whatever's happening in the cluster or whatever the data is requiring it to do or user input. Okay. Yeah, and I think you made an important distinction, right? In Hadoop 2.0, right? Yarn is really the center of the universe and that's really driven by marketplace dynamics, right? So the Hadoop cluster is no longer just a mass storage device, but in fact, it is a mass storage device, but now there are many different kinds of applications that need to take advantage of all of the data being stored. And those different applications have different requirements. Some are very IO intensive, some are very CPU intensive, some are very memory intensive. And so being able to manage those multiple applications, those multiple engines against the same copy of data is really the value add that gets with Yarn where you can actually allocate the resources out for a multi-tenant application kind of environment. Okay, so give me an example where you would have the CPU intensive, the IO intensive, where by having them share resources, you get overall better utilization and they don't really step on each other. Yeah, so you may have some sort of workload where you need to scan all of the data for some sort of analytic, right? And scanning all of the data is very IO intensive, right? And in the world we live in, unless you stored all the data in memory, which in these instances is not going to be the case, you're going to wait for the IO bandwidth of the disk drive, it's the equivalent of an email versus shipping something on a cargo ship around the world. So while you're waiting for that, having extra CPU allocated to your job is of no benefit. The CPU will be idle. So by having a resource manager, a resource broker in the center there and being able to parse things out for jobs that need it and save that resource for another job that might need it more and make prioritization decisions based on ROI, based on the timeliness of the answer, is really where it's at. So it's about matching the different dynamic workloads and taking full advantage of the platform. And then, as was mentioned with the partnership and some of the newer products with HP, you can now extend that into the actual hardware configuration, right? So in a standard commodity stackup, the number of disk drives and the CPUs is fixed, right? Rack them and stack them. And that's a good solution for many opportunities. But in the end, you're going to have a finite number of disks. You're going to have a finite bandwidth of IO and finite memory and CPU. So you can reallocate that among your jobs, but you can't really change the ratios. And so I think part of the power of the partnership that we have is that in addition to yarn at the center of the universe, multi-tenant application stacks, you can now start to change the ratios in the configuration dynamically as new applications come online. Okay, boil it into an example where you had sort of fixed allocations and ratios across your nodes. And now you want to say, I want to allocate more IO bandwidth for the batch analytic job that's just pulling stuff off disk or flash drives. And then, you know, I want to use, while the CPU is relatively idle, I want to use it for something else. You can use that for intense calculations, right? So if you're running neural network algorithms and they just require a lot of CPU and memory, while the CPU is waiting for the IO and the other job, you can redirect it and have it go calculate those answers. But let's say in a rack, which would have many computers, could you sort of allocate more IO so that, you know, the analytics get done faster than if you just had the fixed amount? You can reprioritize existing resources in a traditional rack commodity stackup, but you can't allocate more than exists physically. Whereas you can with this new reference. With the HP reference architecture, you can actually add independent of each other, CPU, memory, and bandwidth to the configuration. And that's an important distinction here, because what we have to recognize is that in the new HP big data reference architecture, we are adding this new way of doing Hadoop, where you've got Apollo 4000s, which are basically the map, I'm sorry, the HDFS functions of Hadoop. And then you've got Moonshot that's running more of the map-produced functions of Hadoop. And so when you couple that with the Yarn project, you're able to be intelligent about, do I use kind of maybe my standard DL380s for certain jobs? But if I have a storage compute, I'm sorry, storage intensive type workload, I can dynamically allocate some of those workloads to go focus specifically on what's happening at that portion of the cluster. It's really getting away from this notion of just kind of every block is the same, and we can do a bunch of cool things with it. We can be far more intelligent now about where these workloads need to run, so they run optimally and you get more results faster. And frankly, you can actually do it in less data center space and with less power. Okay, so then there's this, you know, I'm going to debate when do I run on-prem, when do I go in the cloud, when do I go hybrid? How does this change that trade-off? I mean, my perspective is, I think it's adding a little bit more, you know, it's adding a little bit more thought to how these things are allocated, right? So I wouldn't say that it really changes the paradigm too much. There are some workloads that absolutely make a lot of sense to be in a cloud environment, off-premise, and some that make sense to stay on-premise. I would say that this is now helping us to realize we need to be more thoughtful about what's running on our clusters and where and when and how we want to run those things. In particular, you know, like I said, if you want to scale the storage characteristics of your cluster, you no longer need to really, you know, add a compute-plus-storage solution. You can actually go and just scale the storage component of it. Same thing with compute, you know, you don't need a storage-plus-compute solution to solve a compute problem. So I think on the on-premise side of things, I think we're getting a little bit more intelligent as a community to say, you know, one instance does not equal every instance. Let me try an analogy for folks who've, say, become familiar with VMware and server virtualization over the years. So would this be somewhat analogous to saying, okay, I've carved up a server into many virtual machines, but now I'm going to expand the compute on this box and I might expand the I-O reservation on this other box. Would it be similar to that? It's a way to think of it, but even in that context, the virtual machines can never allocate more of a resource than exists. In this scenario, you can actually add more of that resource physically and take advantage of it across the cluster. That's exactly right. Okay, let me try another analogy, see if I understand it. Exadata has, like, three networks. So, like, there's a control network, there's a storage network, and then there's, I guess, the sort of cluster network for the database nodes. So you can add capacity to the storage network as well as storage independent of the database cluster. Is that one way of thinking about it? In that scenario, it would be similar. It's so, you know, compared to traditional Arrak and Stack, right? If you have servers in your Arrak and Stack commodity cluster and each server has 12 drives, right? And you've got 100 of them, right? And you say, oh, I need the equivalent of 20% more IO bandwidth in that system. Effectively, you'd have to go put two more drives in each one of those servers, which would, well, there may not be slots on the card. It's hard, right? So, with this technology, the idea is you can just spool up more disk drives and not have to go open everything up. So it creates more flexibility. And it really changes that ratio of CPU to IO, right? So it becomes an extension of the workload management and yarn. And I think, you know, the idea really is, I don't think it changes the cloud paradigm, as was said, but just more, it's another choice, right? So when I run out of a resource, instead of just having to add bulk, a whole bunch more racks of servers, I can be more intelligent about what I add. I can just add the resource that I need, right? And I can change my mind over time. Let me give it kind of an example, right? This is what customers are doing with both of us today. A lot of them are using, they have a Hadoop cluster that's built of standard general purpose 2U servers, right? So standard 2U server, 2U high, you know, a meter long, somewhere between 14, 15, 16 drives, okay? And that is kind of the basis for how they're building their Hadoop cluster. And depending on the capacity they want, they'll kind of build up that rack, right? You know, 15 of those servers in a rack. And by running your workloads in that cluster, let's say that we are now able to tell that I'm having a problem with hot data and the CPU envelope that I have with that, the IO envelope, it's not sustainable. It's slowing down, there's a performance lag. In a traditional architecture, you would basically add more of these kind of 2U servers which has some drives and some compute. What we're saying is that in this new asymmetric way of doing a Hadoop cluster, you've got, for example, Apollo 4000s which essentially is one drive with, you know, 28 large form factor drives. So it's built for storage as opposed to a standard general purpose server. And then you've got Moonshot doing the map reduce functions which doesn't have a lot of storage but has a tremendous amount of compute power. So now that we know if there's a bottleneck with CPU and memory, we just scale the Moonshot portion of it. If there's a bottleneck with the storage capacity, you can add more Moonshot. Just add more. Right, add more Moonshot. So instead of, I like to say it like, why solve a storage problem with a compute solution or why solve a compute problem with a storage solution or the vice versa, whichever one I say. It lets you be more intelligent. Okay, so for hardware and newbies, I mean, it's beginning to sink in. When you're talking to customers, how easy is it to explain this story, you know, and are they, I assume proof of concepts start out with like six servers and they're basically kicking the tires. How does this, and I'm going to have to make this the last word, how does this make it easier for them to get, you know, from proof of concept to say into production when they're worrying about reference architecture or how to size their capacity. Yeah, so in our interactions with our customers, most of our customers are mostly concerned, they've usually started Hadoop, Hadoop environment. The Hadoop software customers, they, we usually have a conversation that says, look, you probably may not understand what's happening at the IT level, but it's going to impact you when your IT organization says we've run out of space, we've run out of power, sorry, your Hadoop cluster can't grow any further. From the IT team's perspective, they're already trying to figure out, you know, I've got this cluster of a certain size, I want to grow it. Finding capex to build out a new data center is a problem, going to a new colo or getting a colo is a problem. What this architecture lets you do is essentially take something that was traditionally like two racks and do it in half a rack. So in addition to the scalability, you've got all this density that you get and power savings as well. All right, with that, that'll have to be the last word. We'll be hearing more from you. Scott, Joseph George, thanks for joining us. George Gilbert at Strata 2015. Big data in New York City. We'll be back shortly.