 Live from the San Jose Convention Center, extracting the signal from the noise. It's theCUBE, covering Hadoop Summit 2015, brought to you by headline sponsor Hortonworks, and by EMC, Pivotal, IBM, Pentaho, Teradata, Syncsort, and by Atunity. And now your hosts, John Furrier and George Gilbert. Okay, welcome back, everyone. We are live here in Silicon Valley at Hadoop Summit 2015. This is theCUBE, our flagship program. We go out to the events and expect a signal from the noise. I'm John Furrier, the founder of SiliconANG, I show my coast. George Gilbert, Big Data Analyst at wikibond.com, and Anjul Bambri, VP of Big Data Analytics at IBM. Welcome back to theCUBE. Great to see you. Thank you, John. So, this ecosystem is Big Data. Hadoop's a big part of it, but Big Data is a bigger conversation, but yet Hadoop is growing. Gartner's reporting numbers that over 50% of the enterprise, which I think might be a little bit light, might be more, are engaging in Hadoop. What's new with IBM, and what's happening with the ecosystem of Hadoop, and how are you guys playing with it? Sure. You know, we've been on this journey of where we embraced Hadoop almost like five years ago, and have done work in the Hadoop core, as well as built capabilities on top of Hadoop, like SQL on Hadoop, like bringing things like text analytics, machine learning, embracing R, and enabling R programmers and users to use the scalability of Hadoop and build algorithms that can scale for Big Data. So we've definitely added value on the top, and recently there's a formation of the open data platform where over a dozen industry vendors in the space join that to really bring more standardization around Hadoop. I mean, one thing that we all acknowledge is that Hadoop is not one project, but multiple projects, like 20 projects. And to get standardization around those and to get interoperability, something like ODP was needed. So we are definitely very excited about that being the next step, which I absolutely feel will drive even more adoption and will grow the Hadoop market faster. We were seeing on our intro today that having ODP out there is kind of like Google search. It's organic, which is open source, pure Hadoop open source. And then there's also kind of on the right hand side the ads and whatnot, stable. You just know what they're getting when they click on ads at Google. So that was kind of my weird metaphor, but like with standard ODP, you guys now can support a hardened version around that at IBM and others, while not compromising the innovation in open source. Is that kind of the strategy? Because you don't, customers not moving as fast as they open source. Is that kind of the thesis behind it? No, absolutely John. Because the sense of ODP is all around compatibility and collaboration, right? And while working in the context of Apache, right? So it's not like we deviate from that, right? So we work in the context of Apache. And what this allows us to do, I think two-fold, one is it really allows all of us to pool our resources and harden the Hadoop core, right? Because instead of each vendor working on different versions of these different projects, making fixes, and then you lose time by the time those fixes get into the code base and everybody else picks it up. But if we were to standardize around that, if we are using the same versions, we are testing the same versions, so that makes the whole Hadoop core much more stable and reliable and we are not, our resources get pooled. And at the same time, it helps us to innovate on top of the Hadoop core, right? Which is around text, around machine learning, around SQL engines. Those are sort of like the horizontal engines, if you may, which are built on top of Hadoop core. So that innovation continues. And the same thing for our ISVs who are building vertical applications leveraging Hadoop. It gives them an opportunity now that instead of testing their applications against seven or 10 different versions of Hadoop distributions, they build ones, they are focused on innovating on those solutions and not spending time on testing against every version of Hadoop. So both the horizontal engines, the innovation and the vertical application innovation gets focused, goes faster. And there's a broader set of customers that is now able to use these, right? Because there's no, it's almost like you don't want customers to get locked in because they went with one vendor's version of Hadoop. I mean, that's unfortunate, right? That in some sense, we are saying software is open source, but if you lock these customers in in other ways, I think that's hurting the adoption and the growth of Hadoop. Would it be fair to say that Apache solved the problem of sort of the upstream standardization of individual projects, but that Hadoop has grown so fast that it's delivered innovation, but the governance model of Apache doesn't lend itself to standardizing the breadth of innovation. Something like what, 17 projects in a typical distro? So ODP was needed downstream to standardize that core. How many, tell us about so exactly what's in the core and then sort of how, when the vendors will be aligned around that core, and then when that starts to expand. Sure, that's an excellent point that it's not, like you said, if it was only one project, and that's all there was, then people can just download it from Apache itself and be done. The very fact that there are companies that have formed around, just around Hadoop distributions to start with, proves the point that it is that kind of discipline if you may was needed. Now ODP kind of takes it to the next level that that was a good step, but you don't want fragmentation and fracturing happening because of companies that were trying to bring sanity to this are now lending to fracturing, right? So while ODP right now, it comprises of some core Hadoop projects like HDFS, like MapReduce and Yarn, which really enabled everybody to write distributed applications. Ambari is another project that was added to it, so it's really these four projects. And something like Ambari was needed so that there is a common way for all these different projects, as well as the value ads on top, to be able to be installed, configured, to be able to do the monitoring and inverting in a standard way. You don't want that if you have Hadoop projects or services running in the cluster, then you have the value ads running in the cluster, then you have ISV applications running in the cluster. Like to just administer this, if you're dealing with five different consoles, I mean, that's a nightmare to deal with, right? So the core started with these four projects, but that's just the first step and that'll help bring interoperability to an extent, but I think it would be incorrect to say that these are the only four projects that are needed to bring complete interoperability. So- What comes next? So things like, we have to look at things like HBase, Hive, Scoop, right? That what is going to be the common way to ingest data, what's going to be the, you know, as information is being stored in HBase and Hive and there are value ads that are leveraging those components then their standardization is needed as well. So those are some things that are being discussed in the, you know, the ODP technical working groups and, you know, because we have to work all together in the community to agree to those things and then bring it to ODP. So talk about the next beyond Hadoop. I had a chat at IBM Interconnect and we were talking about Hadoop and, you know, IBM, Hadoop is a small part of the overall IBM vision of big data. Customers also have the same perspective where they have mind share with Hadoop, great place to store stuff, but acting on the data is a big discussion. What technology are you guys selling on top of Hadoop? What can you share in terms of what use cases they're working with customers? What products are working for you guys? Can you just share a little bit of insight into what's happening within the IBM prize with respect to customer deployments? Yeah, sure. So you're absolutely right, John, that, you know, Hadoop is obviously one of the, one of the frameworks, right? One of the processing engines from a, like MapReduce standpoint that we are leveraging. But in terms of, you know, what things are needed now to really bring value from data and turn it into intelligent insights, right? Is that, you know, how do you prep this data? How do you cleanse this data? How do you wrangle this data? I mean, we've all heard this from every customer that they're spending 70 to 80% of the time shaping the data, right? Getting it really ready so that they can get value out of that data. So there is work that's happening in IBM as well as outside around data wrangling, data shaping. And being able to do it both programmatically as well as, you know, using things like SQL for transformation, using things like text analytics, machine learning to be able to do that. And then there is, you know, you need absolutely very powerful visualization. And, you know, we are embracing D3 as an extensible framework from a visualization standpoint. And, you know, just to make it easier and easier to shape that data. Then, of course, it comes to that, you know, you want to be able to, from an IBM standpoint, of course, you know, we have our predictive and prescriptive analytics portfolio around SVSS. So we are certainly, you know, from where we are going, we are going to be leveraging things like, like, of course, projects from Hadoop as well as Spark to be able to scale out the predictive and prescriptive models and algorithms. How do you explain to customers, the comment you made earlier, because I liked how you described that, the horizontal engines and then the verticals because vertical stacks have been around for decades, right? And, you know, what's constantly scalable is a cloud DevOps concept, and we've been speculating on theCUBE that, you know, analytics and big data is the killer app that DevOps has been waiting for. Because now, with the scale piece, you have new kinds of innovations, net new capabilities that are emerging, not just the blocking and tackling stuff you see off load, you know, and analytics, but real time to get to that cognitive computing or, as George coined the term, systems of intelligence. Well, borrowed from Jeffrey Moore, but we want to dive into actually how they're built with you and others. But, yeah, like, you know, taking Spark and Hadoop, very different philosophies for how to build systems of intelligence, when you look at, you know, a customer problem, you'll see, you know, Hadoop on one side, Spark on the other, how do you make the choices to which to use? Spark is obviously less mature, but much simpler, Hadoop getting more hardened, but, you know, there's that complexity factor. So let's, so if we look at Spark, right? I mean, there are obviously projects from Apache Hadoop that are still very relevant, right? Even when you are building applications, leveraging the Spark core processing engine. So, you know, you still need something like Ambari, you still need, you know, there is data that is going to be stowed somewhere, so HDFS is certainly still very relevant, right? Yarn is relevant because, you know, you're still running these applications in a clustered environment. Now, when it comes to like, what are the benefits of Spark, you ask, right? Where would you use one over the other? When you say one over the other, I would think it is Hadoop MapReduce versus, you know, the Spark core processing engine. So of course we know Hadoop MapReduce is very batch-oriented in nature, and, you know, so applications for which that kind of latency is acceptable, then they can continue to use Hadoop MapReduce. But when you go to, you know, look at Spark, there is benefits in terms of obviously performance, and it is also, you know, functionally it is very rich, right? So the kind of applications that it is enabling is interactive analytic applications, right? Where you're still dealing with big data, but the latency is extremely important, right? Low latency is key. So in those kinds of, you know, when those kinds of requirements have to be met, then it's, you know, Spark certainly offers advantages in terms of performance based on, you know, I mean, there are details that we could get into but it is making it easier for, you know, in memory, right, running it in memory makes it much, much faster. Being able to, you know, with the interim results that are being stored in Spark is using local storage, and, you know, the execution, from an execution standpoint, you're not starting and stopping the JVM every time, right? You have the JVM running on every node, so the tasks that have to be run, they run as a thread in the JVM. So all of that, right, now when you come to Hadoop, you're starting, stopping JVMs. So the scheduling in Spark is much better. So that certainly improves performance, right? So even if you take the memory away, you'll at least get three times the performance just based on these four benefits. Yeah. Angel, I got to ask you the question about what's going on here at the show for IBM and the conversations that you're involved in. What are the top three conversations you're having here at the show around IBM with customers and partners and people on the show floor? Sure. So there is, there are a lot of questions that are coming around ODP, right? That, you know, people really care about this interoperability. They are, they are very of vendor lock-in and the promise that ODP has shown that, you know, it's going to offer them a much richer technology palette to work with where they could use, you know, Hadoop Core maybe from vendor A, you know, a SQL engine from vendor B, a data wrangling tool from vendor C. So this way they could get the best of breed of, you know, from different vendors is something that they are excited about. So, so that's a topic that has been coming up that, you know, and then if you talk to the partners, they want to know that, okay, with this, can they build ones and run anywhere and everywhere? Customers want trust. Yeah. At large enterprises, they don't want to be the, that scorpion bite, that they don't want that, you know, the joke with the scorpion crosses the river. You know, I'm talking about like, they don't want, they want predictability. Yes. They want some standardization. Are you seeing that too? And what comments can you share and color on that? Yeah, like, you know, when you talk to the, I met some of our customers from their IT group and they feel that with ODP, they can, you know, this is going to bring some sanity in their lives where, you know, there would be one way for, you know, like we were talking about a body, right? To install and configure and monitor and, you know, the cluster. It has been a nightmare for them to deal with the Hadoop administration, if you may, right? And, you know, one of them was sharing that, you know, I can finally go on my vacation, which, you know, I used to keep planning and every time there would be some, something that goes down in my cluster and I would be like, oh, this is mission critical. You need to be here to fix this. So, you know, I don't know whether that, these people can go on vacation tomorrow, but, you know, our goal is to get sanity in the lives of, you know, IT operations folks. Yeah, and also support too. You guys are delivering other products and you need a stable core with ODP. That makes a lot of sense. Yeah, and it brings, you know, a whole degree of cohesion in their IT infrastructure. So, you know, when IT is happy, when ISVs feel that what they are doing is, you know, their market is being broadened. Well, they want delivery too. They got to deliver a solution. They don't want to have more costs. They want gross profit. Yeah, yeah. They want happy customers that write big checks, right? And like you were saying, I mean, cloud as a delivery model is something that a lot of customers are also looking at. I mean, you know, at IBM, of course, we have, for both Hadoop and Spark, we have our services available on the cloud. And it- Do you wrap those in? I'm sorry to interrupt, but this is a key point. Do you wrap those in a different tooling, or do you use Mbari? We are using Mbari right now. On the cloud. Yeah. When you say right now, does that mean there's- Before, you know, I mean, before ODP, we were using a different set of technologies, but now we are standardizing on Mbari. Okay, great. Final question for you. I know we're getting tight the hook here. What is the vibe of the show? What is the main thing going on here in Silicon Valley at this show? For the folks that couldn't make it, what's the key message you'd send to your friends out there? Yeah, so at least, you know, like at this event, I have seen that the number of customer sessions has grown much more than from previous years. I would say that, you know, almost like, it's like a 50-50 split, which, you know, going back even a couple of years, right, at the Hadoop Summit, it used to be much more just around the technology and less about customer adoption. But so it's very nice to see that, you know, now more people are sharing their use cases. They are talking about the value that they're getting from Hadoop for the business, right? And that's, I think, you know- All right. Anjul Bombri, VP. You step in the right direction. Thank you for very much for your insights. Very cognitive insights here inside theCUBE, cognitive computing, sharing the data with you. This is theCUBE. We'll be right back after this short break.