 TheCube at Hadoop Summit 2014 is brought to you by Anchor Sponsor, Hortonworks. We do Hadoop. And headline sponsor, WAN Disco. We make Hadoop invincible. Hey, welcome back. And when we're here live in Silicon Valley in San Jose for Hadoop Summit 2014, this is TheCube, our flagship program. We go out for the events, extract the signal from the noise. I'm John Furrier with Jeff Kelly. Our next guest is Derek DeRuz. We're a worldwide technical sales lead for IBM out in the trenches here. IBM at the show here, we had Inheon earlier. She's great. We're one of our cube favorites. She's dynamic, super smart. We love her. But IBM here is a testament to all the big whales over here. Cisco's here, AT&T. I mean, the names go on, it's the Oracle. You got the big guys, IBM, huge install base. The growing pre-IPO companies like Cloud Air are down to people getting their series C funding to the early stage through series B. Everyone's doing well in big data. So what are you guys doing here at the show? So the big thing that we're promoting at the show this year is big sequel. So people talk about sequel on Hadoop and so on and there are about 10 fairly significant projects that are happening. And ours we think is distinct because instead of rolling our own brand new kind of relational engine for Hadoop, we took an established relational engine. We've done this very well over the years. We have 30 years plus worth of engineering experience and we've taken all that wisdom that's been distilled through discussions with customers and so on and we've made that work with Hadoop. So we have an integration layer that we built effectively. And just to put a little plug in because we cover a lot of your events with theCUBE. We love covering IBM's transformation. Certainly the strategy is looking good. IBM IOD has been changed to IBM Insights. So the conference is IOD Information on Demand is now IBM Insights. So get that out of the way, share that with the folks out there. And the other one is you wrote a book. That's right, yes. Hadoop for Dummies, which is good for us on theCUBE here, we can go through this as a few questions. What is MapReduce? Oh, we're placed with cascading. We just found out from Chris Wenzel. Great book. This one's for you, John. You can have it. I love it. I give it to my son. He's 12. It's the future. Yeah, he was doing a system on a chip and programming just this weekend. So maybe he can get into some MapReduce code. That's right. That's really kidding. Great book. We're going to pick this book up Wiley brand. Yeah, it's available on Amazon.com and in local bookstores. What was the motivation on the book just to get something out there for practitioners? Yeah, so one of the things that we found with the Hadoop space is that there's an awful lot of content at the surface level. So a lot of the slideware and architectures, that sort of thing. And suddenly there's a deep chasm in the middle where there's nothing. And then you go deep into Java code and so on and so forth. So there's really nothing in that level of where people want to get started. And so that essentially is what the book is about. So to take you from that high level piece and to get you working with doing some very straightforward and basic things with Hadoop. You guys cover H-Base in here at all? We have a big chapter on H-Base, yeah. You're seeing good things with H-Base or I'm looking at the tag cloud and not as much conversation this week with H-Base versus a couple of years ago. Yeah, and part of that as well is that this is a Hortonworks conference. I mean, if you go to the Cloudera conferences and so on, there's a lot more discussion on H-Base because that's more than- And H-BaseCon, which we covered at the Cube, the original one. Yeah, so I would say, I mean, in looking at how our customer base is using Hadoop and our Hadoop offering big insights, I would say probably a good 30 to 40% are serious about H-Base at a large scale. And then there's another 20 or 30% or so that are using H-Base for things like dimension tables and that sort of thing on a platform. So a very prominent industry analyst, Tony Baer, who you guys know covers big data. He's in the elite with Jeff Kelly and the Mervadrian. We're talking about the analyst. He said a quote this week, a blog post, that sequels the gateway drug for the enterprise. Meaning, okay, it's going to the enterprise and that is a good bridge to get started. So it's essentially big data for dummies, if you will. Hadoop for dummies, okay. Getting started, you got legacy infrastructure. Do you agree with that? Are you seeing a sequel being the common language for getting started and bridging into a Hadoop framework? Yeah, most definitely. I don't think it's the language. I think it's a language. I think it's perhaps one of the most important ones. And if you look at all the established tooling that's out there, like Cognos and that sort of thing, those are all sequel-based tools. And for those tools to work with Hadoop to be able to take advantage of the processing and the storage that Hadoop offers, then you have to have that sequel interface. And that was a big strategy behind big sequel, which is take an established relational engine and effectively make it so that it can handle the kind of queries in the workload that Cognos would spit out. Because one of the things that we find with a lot of the other sequel on Hadoop solutions is that they're very immature in terms of query support. So very basic things. Not full sequel, it's sequel-like. Exactly, yeah, yeah, yeah. And there's a big data borat quote that they... He talks about the biggest trend in, I mean everybody in sequel is talking about how they support ANSI sequel, but only the ones that they happen to support. Right, that's interesting. Let's take a step back. I think with IBM, you've got a deep and wide breadth of products around data management and data analytics, data integration. You name it, IBM's got a product for it. So Hadoop, I think, to some extent has kind of gotten lost in the shuffle at IBM, at least from a perception from our audience. So help us understand exactly where Hadoop fits in, your larger strategy at IBM. And then I want to dig into a little bit more about the products that you're actually selling on their insights specifically, but just maybe first just talk about the larger point about where it fits in. Yeah, so for us, Hadoop is critically important. So the view that we have for Hadoop, it's not just a view for Hadoop, it's a view for the enterprise, for the data center. And we see Hadoop as a disruptive technology, not to replace the warehouse, but to enable you to get more value out of it. So the idea being that for large scale ETL jobs that you can use that on Hadoop with a mature ETL tool like a data stage, for instance, which where you have your ETL engineers simply see Hadoop as yet another source or target or a place that work can happen. And at the same time, I mean, if people want to use Hadoop as a landing zone or I like Merv Adrian's term, the data reservoir, that kind of notion, then I mean, that makes piles and piles of sense. But the warehouse is still, I mean, as far as I'm concerned, not going anywhere. I mean, it's been designed to the nth degree by performance engineers and so on to handle large scale, repeated queries of the same kind of reporting. And Hadoop doesn't have the DNA in order to sustain that. Hadoop has very different DNA that makes it very useful though for transformation and dealing with variable schema data, that sort of thing. Well, it's interesting as Hadoop develops where it'll be interesting to watch the direction it goes and where there already is some overlap with the data warehouse and it'll be interesting to see if that overlap kind of continues or if they seem to kind of find their own, John and I spoke of their own swimming lanes and they kind of find their place. But let's take a little bit more in specifics around Hadoop at IBM. So you've got big insights as your distribution. I'll talk a little bit about how you package up and sell that and let me take a step back. Do you monetize that or do you give that away for free? And how do you kind of, how do you make it into a product? So I like to break it down into two kind of main areas of focus. So I mean there's some customers that focus on both of these and in many cases just one or the other. So on the one hand it's about infrastructure, about workload isolation, storage isolation, that kind of thing. And with big insights we do have an optional alternate file system. I mean we can use HDFS or you can use IBM's file system called GPFS which came out of high performance computing environments and enables very nicely that storage isolation, like a multi-site replication, all those sorts of things that people need in certain circumstances. We also have workload isolation and workload enhancement through a tool called platform symphony. So platform was an acquisition that we brought on probably about I want to say about a year and a half, two years ago. And they're based out of my hometown Toronto and they're very big in the financial space where they have a lot of racks, a lot of blade servers and you need to manage the workload between those. And what they've effectively done with their offering which is now parts that are bundled in with big insights, you and again it's an optional piece, you don't have to use it if you don't want to, is you're able to effectively treat Hadoop like a grid which is an interesting notion. And I mean yes to some degree the yarn and MapReduce and Spark and so on are taking us there to a point but it's still kind of an emerging space, right? I mean with platform we have mature established software that's been around for a dozen or so years which and they know how to do this. So that's the one half which is really around enterprise stability around the base infrastructure. The other side of it is analytics. And in my opinion that that's where you have a greater, I mean that's where the opportunity to really use Hadoop as a transformational tool for your business, that's where that comes in. Totally agree. Yeah so I'm sure that most of the folks on your show have been going down the road. Yeah well I mean we've had just earlier today somebody pointed out, you know if you're just looking at it you purely as a storage platform you're kind of missing the point. I mean it can do that very well but it's really the analytics and the insights that it enables, it really makes it transformational as you say. Yeah so when John mentioned the quote about the gateway drive, I see the cost saving potential of Hadoop as the gateway drive. That's how you're going to get started, get in and then move on to the analytics. Yeah so it's kind of like come for the commodity hardware stay for the analytics kind of thing. Very cool. And so in terms of actually packaging it up you've got a pure data for Hadoop you've got the appliance, is that right? Yeah so that offerings being changed a little bit in the recent announcements about that coming out which I'm pretty excited about. So our Hadoop model and I'll just say this because people are familiar with it is very akin to what Cloud Era has done. So we have an enterprise edition which has all the bells and whistles like the full meal deal for analytics and for the file system alternative and the processing assistance. And then we have our standard edition which has what I like to talk about a lot which is a big sequel and some of our other kind of application developer tools all that kind of stuff. But we also have a free edition, a quick start that you can download and install and use as much as you like but there's no support for it. So is the free edition, to what extent is it Apache, open source, some of the project components? Yeah good question. So those that know about IBM Big Insights what they often say is, I mean this is IBM, this is another, one of the big whale companies that they want to lock us into their platform and they're going to force us to use proprietary things that keep us tied to them. So the core of IBM Big Insights is Apache open source. So the same components that Cloud Era, Hortonworks, I mean that if you download Big Top and kind of roll your own, it's the same stuff. What we've done is we haven't monkeyed around or changed those things, we have extended it. So it's the embrace and extend model of doing work with open source basically. So something like Big SQL is where you're extending it but the core is Apache, so you've got that open core model and then layering that with some proprietary tools that actually enhance the, it's very cool. So like for our file system, I mean at install time when you decide to, when you work on installing this you choose, do you want to go, I mean do you want to stick with HDFS which is entirely appropriate and suitable in a number of cases or if you have these complex replication requirements and so on, then GPFS is probably a better choice. So, but it's an option and so it's not a lock-in in that sense. So let's talk about the IBM situation. So IBM has got so much going on with open source. Yeah. Blue mix is around the corner. That's right. We're actually not around the corner, it's actually out there now. Yeah. Adam's trying to get me to go to Blue Mix. Adam, let's talk if you're watching. I'm sure he's watching while it's a busy day he has. So you've got a lot of open source going on. Reconcile the big data strategy with all the work going on in the cloud and open source. Explain to the folks out there where IBM stays open and IBM becomes IBM. As always, IBM has value on top of it. Yeah. You can parse that through for the customers out there. Sure, yeah. So I'll come back to the analytics piece. So I mentioned there are probably close to a dozen now of these SQL on Hadoop projects. And it's not just SQL on Hadoop, it's also streaming data. About a half dozen fairly significant projects of streaming data on Hadoop. And graph databases, statistics, all of those things. So I mean, I am very much an open source person. I mean, I've invested the last number of years in my career in Hadoop. I wrote this dummies book. So I'm all in. And I completely believe in that open source model. And that reflects IBM's value. We were one of the first companies to really back Linux in the early 2000s. Now, when it comes to, I mean, so we don't sell Linux, but we do make a lot of money through Linux by all the tools like DB2 and all these other tools that work on Linux. So contributing to that base, which we do do with Hadoop quite extensively, most recently with H-Base, for instance, we have a number of developers who were code contributing to H-Base to solidify their backup and restore operations and so on. And that's something that we donated. We felt that it was better to donate it than to monetize it. I mean, IBM doesn't hide the fact. I mean, I interviewed Steve Mills and he's very candid. We love enabling the market. Open source is a great foundation to build off of to accelerate value fast and we add value on top of that. I mean, that's IBM. That's what IBM does. It's no secret to that. And you charge for it. That's true. And so, but this is an emerging space. So Hadoop isn't done yet and won't be done for a while. So while we have all these parallel efforts going on, like for instance with big SQL, we think this is the best SQL alternative. And it's largely because we have established engineering. We have a refined offering. One that works. One that's fully SQL compliant and so on. Well, Actian might disagree with you on that one. They claim that they got the power platform. Well, we'll see them benchmarks. Bring it on, right? Yeah, exactly, yeah. That's what I love about. So let me just go through MervAgeman's quote. So I brought up Tony. Obviously, Jeff was on stage with Keena. So was Merv. Tony's quote about the gateway drug. Merv's quote on the cube here was MervAgeman Gartner said, this is a 10 year cycle. Yes. And I totally agree with him on that one. So if we're in a 10 year cycle, we get the dummies booked out here. We're getting all up to speed. So the tinkerers are building for tooling. You're going to see a natural evolution around automation, reducing steps. We're seeing cascading has got some traction. Things like that are making that produce better. Other things here and there. What has to happen in your mind to accelerate the evolution of Hadoop and the value that will emerge from it? I think it comes down to market demand, frankly. And one of the things that we're seeing in the market now is that people are done kicking tires and kind of like MervAgeman's keynote address where he made kind of flight of the waiting for Godot play. So it's not time to wait anymore. It's time to move. And we're seeing, I mean, one of the things that's exciting about this year for us is that we're seeing across the board there's a much greater appetite. We're being asked about our Hadoop offering in many, many of our customer engagement. Merv said it's about revenue. That's the new scorecard. So talk about the customers. What are the best customer use cases that you could share? Don't name names on customers, but just talk generically about the most popular Hadoop-like environments that you're engaging customers with. For me, the most important one, both strategically from an IBM perspective, but also when it comes to customers to get value out of Hadoop is warehouse modernization. Which is effectively have Hadoop live alongside an established warehouse to act as that landing zone, to act as a day zero archive and also to enable people to offload transaction or transformation work. Well, tell us what's coming up for IBM and around Big Data, around Hadoop. And I think we've seen things with Watson and how that might play in. And because a lot of talk here at this show I've been about where all the applications and one of the companies that's actually working on some of the applications is IBM with what you do with Watson and reaching out to developers, the Watson Cloud. So tell us a little bit about what's on the future horizon and maybe if you could talk a little bit about the application landscape and your plans around that. Yeah, so that's one of the, I mean, working in this space and especially working with research, IBM research, is one of the things that keeps me really excited about working for IBM. IBM research is truly elite. And there are two things that are coming out of research that are going to be making their way into the product at some point in the very near future. And that is what one is a, I mean, we have something called big R now. So the question John just asked me earlier that I didn't finish answering about, is SQL the gateway drug and is it the most important language? I said it's A, A important language. The other one, other important languages are. So we have big R, we're able to run R on Hadoop, not just with libraries, but with custom built code, which is extremely novel. At scale. At scale. Because R was not designed for that kind of distributed environment. Exactly, that's right. Now, the way big R works is it breaks up the workload by partitioning data. That will not work for many statistical operations, of course. Like a straightforward thing like a mean or an average, you can do that. But for large scale machine learning, it falls apart. You will not get meaningful results. So what our research group has been doing, which is unique across this industry, is that they have extended this and they've built their own set of primitives and a compiler to take an R-like statistical language, basically treat it like a declarative language, which is very similar to SQL, where you don't care how the work is done, you just define what the operations need to be, what the data needs to turn into. And so they've done that for statistics and they've built this compiler so that it can take advantage of parallel architectures. So where possible, these algorithms will be paralyzed. And that's something that's an extremely non, it's as non-trivial a project in computer science as can happen. The other is entity analytics. That's another project that came out of research. And again, these are very, I mean, like understanding the context and meaning of what people say or both the type on their computers is another almost nearly impossible thing to do, context matters, audience matters, all those things. So we have this analytics framework, which is built on top of our text analytics, which also came out of research but has been in our product for a long time, to be able to do really interesting things like matching it, that sort of thing. All right, we'll look for that for sure. Sure, thanks for coming on theCUBE, really appreciate it. Thanks for watching and then thanks for the nice comments you said before we came on camera, your book here, Hadoop Dummies. I tweeted it out on the crowd chat, go check it out from Wiley, the search on Amazon, Hadoop for Dummies gets you up to speed, really about Hadoop 2 forward. It's cutting edge data on getting your arms around the value proposition of Hadoop quickly. It's really not for dummies, if you actually can read the book, another dummy, so. It's a good primer of what's going on out there. Thanks for coming on theCUBE, really appreciate it. We'll be right back after this short break here at Hadoop Summit, we'll be right back. Thank you very much.