 Five from the FIIA Barcelona Grand Via Compensator in Barcelona, Spain, it's The Cube at HP Discover Barcelona 2014 brought to you by headline sponsor HP. Welcome back to Barcelona everybody, this is Dave Vellante, Steve Trameckis here, he's the director of engineering within big data, within converge systems, Steve and I met at Hadoop World in October and we had a really interesting confidential conversation which we can now talk about a little bit. You guys are thinking about the big data infrastructure problem differently based on what customers are telling you, so why don't we share with our audience what you guys have been thinking about. Yeah, so as we started talking to customers about big data challenges and kind of next generation infrastructure, we kept hearing the same things over and over again, we talked to customers and said, you know, I've got this converged infrastructure for all my line of business apps and then I have these Hadoop clusters that are popping up kind of around the periphery of my data center and when one pops up another one pops up and often times the same data, trying to move data back and forth between them as a challenge. We're also hearing that from a workload perspective, particularly as Hadoop evolves and customers are looking more at real-time technical workloads and see what they want to do, that it really wasn't an optimal single workload platform for Hadoop, right? Map-reduced workloads are very different this far. So we started hearing these kinds of things and looking at trends like cheaper commodity fabrics, 40 gig ethernet is now a bit price-reduced, so it's only marginally more than 10 gig ethernet and some workload-optimized servers. So we started looking at what if we break some of the common perceptions and the fact that when Hadoop was first released, 100 megabit networks were kind of the prevailing network and the whole concept of aggregating compute and storage may have been applicable there, but we said, well, what if we split compute and storage and not do it in a proprietary sense but do it what's enabled by yarn, the fact that we have multiple workloads accessing HDFS and set up an environment where an asymmetric Hadoop cluster can do it, so we have storage-optimized nodes running HDFS and compute-optimized nodes running yarn and macro-reduced server services. Okay, so yarn and sort of the next gen of Hadoop was a critical factor in exposing this problem, right? So let's maybe unpack that problem a little bit more. So typically, I remember Jeff Hammabacher when we interviewed him in the early days of Hadoop and Cloud Era, he said when I was at Facebook, what I really was focused on passionately was I didn't want to put all my data into this container, this million-dollar box that made a face when he said it, so I set out to change the way the world stores Hadoop and so that led to a bunch of commodity components and a bunch of batch jobs working on those, so where have we come from there and what drove that? The ecosystem continues to evolve, right? We see other projects that are spinning up, leveraging the fact that HDFS has become kind of the drop box for big data, right? It's the common landing place for everything, all data lands in HDFS and multiple apps and multiple workloads and now access them. Somebody said to me, Hadoop is the new tape. Right, so we've got all this, all of a sudden, instead of just batch, we're seeing real-time processing, we're seeing streaming, we're seeing a lot more SQL on Hadoop. For instance, we just released the Vertica SQL on Hadoop that runs directly on the Hadoop nodes. As organizations start to aggregate and assimilate the data and analyze it and they're starting to see business value from this, all of a sudden they're moving from just batching to multiple workloads and those workloads bring it, multiple copies of the same data if they're in silo clusters, which is kind of the traditional way to deploy it. They also bring different workloads. Yeah, so again, you mentioned YAR, which allows you to do much more things in parallel. It stands for yet another resource negotiator, so you geeks out there. And then Spark doing things in real-time, bringing SQL and opening up the expertise base of people who can actually now code and take advantage of Hadoop. So now you're getting a lot more diverse workloads, so people want to scale compute independent of storage. And that's really the problem that they're solving. Right, and that's what we're doing. In fact, we talked with Hortonworks in private era in MapR about this about eight months ago and Hortonworks said, by the way, we're working on a feature for the Apache trunk called YARN labels. What YARN labels do is they set up administrative groups of nodes so that you can run just a certain set of YARN containers in a certain group with a certain label. So what that now enables us to do is we can specify, I have these nodes over here that have big memory. I'm going to call that my big memory group. I've got these nodes here that my map produces. I've got these nodes over here that are my SQL and Hadoop nodes. And these YARN labels now enable you to run multiple workloads on compute against the same ACFS here and elastically reslice. So nighttime, you can do all your batch jobs to prepare your data for the next day. You can apply compute nodes. If you have an architecture where computed storage are separate, you can do that. If compute and storage are aggregated, now all of a sudden you have to refortition data every time you reallocate resources. But if you can disaggregate compute and storage, now we can re-slice the compute nodes based on time of day, based on resource. We can have batch at night and then more real-time in SQL and ingest during the day. And the same nodes. This architecture, it's called the HP Big Data Architecture. So we've been blogged about today and we're showing it here at Discover. We have reference architecture that would be public within the week. It leverages our moonshot, highly dense compute nodes in our SL4540, highly dense storage nodes, all running Hadoop. So it's an asymmetric Hadoop cluster. And it's a standard open-source Hadoop. Okay, so I want to dig into the product. But before we do, I just want to sort of restate the problem a little bit if I can. So you've got all this diversity going on, all these mixed workloads. And I'm going to tie it back now to Converge. You've got Converge in your title because Converged systems are these blocks of infrastructure that you put in and you can't scale them independently. And in the Hadoop world, data is coming in so fast, people get ideas, the data science says, oh, let's spin up some new clusters. The characteristics of that cluster very well may be different. So tie it back to the Converged problem. Yeah, so exactly. In fact, that was a challenge with our, we had a Hadoop appliance. And the biggest piece of feedback we kept hearing from customers is, I love the fact that there's a block that's supported together that's easy to deploy, but my workload is slightly different. And now I think particularly with the ecosystem being much richer from a real time and the streaming perspective and SQL on the new perspective, that's even changing more. So this concept of convergence of compute and storage resources makes a lot of sense. But this ecosystem just at the pace is moving just demands flexibility in no time. Okay, now let's dig into the product a little bit. Moonshot based, talk about the components. Yeah, so this is a reference architecture. It's not a Converged system. It's just a reference architecture that we've released. And we're running yarn and map reduce and compute functions on our Moonshot nodes. For those that are familiar with it, it's a 4.3U chassis. It's about that big. And it has 45 microservers in it. Each server is an Intel E3 Xeon processor. It's a Haswell-based processor. It's very similar to the processor you'd find in a really high-end notebook, an i7 notebook. But it's an E3 Xeon processor. So it's a four-core proc. It's got memory, 32 gig of memory. It's got an SSD, 480 gig SSD, and dual 10 gig of it mix. It's super dense packaging. Very 45 of them in that space. And then the storage block is our SL4540. In the same packaging, we have two servers. Each have 25 disks. So we have 200 terabytes of storage in 4.3U. And with that architecture, what we're finding is from a density perspective and a power perspective, when you just run HDFS on those nodes and pull some of the map reduce and local file system access off of those nodes, they run very efficiently. In fact, we're running faster IO tests, DFS IO, remotely than we do with everything on the map. Really? And now, where does Flash fit in to the architecture? So Flash, Cloudera did some testing. They looked at, for instance, for shuffles, the map output file and local file system access. If you could put that on Flash, it would accelerate shuffle performance. So in the moonshots, each cartridge has a 480 gig M.2 stick. It's a next-generation SSD. So we're using that. We're installing the new fare. We're installing node managers there and yarn and mapperties. And we're using that for all the file system access. So we're gaining the benefit of Flash in that cartridge in a very cost-effective manner. And then we're using spinning media for the products to make the best. So talk about, Steve, how you came up with this idea. And then I want to find out what customers are saying once they've seen it. Yeah. Greg Battis is our Chief Technologist on Big Data. And Greg has kind of a long history in the parallel database world. And these concepts are very similar to what, in fact, if you look at Cray, the Cray architecture, and all these little compute blocks and storage blocks. HBC coming to Big Data. That's exactly right. These concepts have been around for a long time. And really, this was at the impetus of our customers saying, we're trying to figure out how to disaggregate compute and storage and do so in an open standards kind of way. So that we have the flexibility to scale out compute, to change the ratios for what goes to demand it, and have data deduplication. So we don't have three Hadoop clusters all with the same data because they have either different departmental needs or different workload needs. Okay. Now, give me some sense of what the customers have been saying. I mean, have shown it to a handful, a dozen. Over several dozen at this point. Really? Several dozen. Been busy this week. Well, actually, this week alone has been pretty busy by the way. But we've, over the last six months, we've probably briefed 40 or 50 customers. What are they saying? They are, we're getting comments like, I've been trying to figure out how to do this for eight, two months. The customers are saying this, you've described what my problems are. And this is, we've been looking for a way to do this kind of an architecture for big data, but we, particularly around the flexibility and being able to scale workloads independently and not have to redistribute data. That's been the primary feedback. So we're currently, again, we just came, we just released the demo and the blogs this week. Everything has been under confidential disclosure at this point, as you know. But we're starting to work with customers now on some live POCs and pilots. So why the decision to show a little leg now? Just trying to learn. Are you worried that, you know, the competitors are going to say, Oh, I can do that too. It coincides with the release of yarn labels and HDFS tiering in Hortonworks 2.0. Right, so these are capabilities. This architecture is really kind of designed to leverage that capability. With HDFS tiering now, you can specify different classes of storage within your hudu cluster for really hot data versus really cold data. And this kind of architecture now lets you build, deploy purpose-built servers for each of these. So you want to give developers visibility on that and take advantage of that momentum in the open source community. Everything, and we're working very closely with the server team here. Everything that we've been doing is shipping now. A totally different mindset in the open source world as to how you release certain products anyway, right? I mean, in the old days, it would be, okay, sackable offense if anybody leaks anything about this. All emails are secured. Yeah, I think it's probably more in line with, well, first of all, with the open source community and the fact that the servers and all the HP products are shipping and we're not doing anything proprietary here, means that, frankly, anybody could go out and build this if they understood what was available to them. Well, you have some advantages and, you know, Moonshot is key. Moonshot is the really interesting use case for Moonshot. The density that we're getting from this architecture of full rack of this hudu performance equivalent is two and a half racks of this architecture. Right. Okay. Now, can we expect going forward to see other sort of solutions around this? You mentioned Vertica before and other parts of the HP portfolio. We're doing additional reference implementations right now. We have documents for important works of Cladera, Mapbox, and Shrek, and we're also looking at things like Spark, HBase, Vertica, obviously, Vertica SQL and Hadoop, and a number of other workloads as well. We're going to be driven by customers. It really comes down to as we're talking with a number of customers. We have a number that have, fortunately, said they'd like to be a design partner with us and kind of model their workloads in this architecture. So we'll be driven by what the community is. Again, very interesting world we live in, right? You mentioned three competing distros. You guys have an investment in Hortonworks, and yet you understand the need to, you know, accommodate others and it's an open world. It's relatively straightforward to do. But it's something that you've got to do if you want to play in this world. So you guys are thinking about it the right way. So that's great. I'm really excited about this and hope to learn more. Steve, I really appreciate you stopping by The Cube and sharing this with our audience. Thank you very much. Thanks for the time. Good deal. Keep right there. We'll be back with our next guest right after this. This is The Cube. We're live from HB Discover in Barcelona. We'll be right back.