 From New York, it's theCUBE. Covering Big Data New York City 2016. Brought to you by headline sponsors, Cisco, IBM, NVIDIA, and our ecosystem sponsors. Now, here are your hosts, Dave Vellante and Peter Burris. Welcome back to the Big Apple, everybody. This is theCUBE, the worldwide leader in live tech coverage. Peter Burris and I are thrilled to be here in New York City. John Furrier courses at splunk.conf, knocking the town there. Bernie Schieffer is here, he's an IBM fellow. Bernie, great to see you. Nice to be here again. Were you on the rooftop last night? I sure was. Beautiful view. It was an excellent announcement by you guys, a fantastic event. IBM obviously knows how to throw a party, but there was a lot of meat behind the bone as well. I had a lot of people ask me, is this really an IBM event? So I thought that was a couple of it. All right. Well, it's a little funky down here, right? There's not like big hotels and so, great job. So anyway, congratulations on that. But we're here. And a number of extremely happy developers. Yeah, yeah, right. IBM gave away some great prizes to you guys. $100,000. Yeah. The prize was $50,000 and two guys split it and the second prize was $25,000, a single individual won that. So you had three people walked away with $25,000 and then some other nice prizes as well. It's going to be one of the most talked about things on the show floor today. Yes, so. And Rob Thomas said he was going to do it again. So that makes it. Yeah, it was good to see Rob up there and outchecks. We like this. Good role for him. And sorry, we're here to talk about a little bit about ODPI, a big sequel. So what's your role at IBM? And let's get into it. So I'm a longtime IBMmer. Have been working in the data field for my entire career over 31 years. But for the last four or five years, I've moved strongly in the world of big data. The IBM distribution of Hadoop called Big Insights. And a couple of years ago, we created a new offering in Big Insights called Big Sequel, the most powerful sequel on Hadoop Engine that we have. Okay, and you're building that into a new Hadoop distro with Hortonworks, is that right? And you're using ODPI as the framework? Talk a little bit more about that. So we've had Big Sequel for a number of years and it was distributed via our own IBM distribution of Hadoop, which continues. But the market in the distribution space is fragmented. Clients have chosen typically one or two distributions as their primary distributions. And unfortunately, not everyone has chosen Big Insights. So we could, just like in the days of relational databases where you support different operating systems, HPUX, Solaris, AIX, Linux, Windows, and so on, you can make software portable to different platforms. And to me, Hadoop is very much like a platform, a big data operating system, if you can call it that. And so with work, you can make anything run just about anywhere. It is after all just software. But Linux standardization helped make it easier to port and run on different levels of Linux. So ODPI is trying to do the same thing for the Hadoop ecosystem. So what is Hadoop really? The classic definition of Hadoop, which people still call up, is it's HDFS, the persistent file system, plus the map-reduced paradigm for processing data. But to me, today, it's much more than that. Why? Because people have replaced components of that. They don't always use HDFS, they use GPFS, or there's other open source file systems, Eluxio, and so on coming along. And really, map-reduced in the classic sense is really maybe more of a legacy system today. Spark is the hot topic. I mean, you heard that last night at the Data First event. So, but does that mean Hadoop is dead? Absolutely not. I mean, it couldn't be more vibrant. Look at this conference. I mean, it's exploding. So what is today? To me, I think Hadoop should be redefined as an ecosystem of collaborative tools. Very much like Unix became an ecosystem of tools. And so we need to be able to interoperate software that sits on top of that platform. So the more similar different distributions are, the easier it is for people to build middleware or applications on top of Hadoop that can run on a choice of Hadoop distributions. Okay. And more rapidly, you can create experience, understanding, and knowledge so we can start focusing on use cases and the domain expertise associated with use cases and the real insights that these tools are supposed to do. Which is the real value everyone wants. The rest is sort of plumbing, right? And you want all the tools to run on a variety of distributions. Well, I got a lot of questions. Actually, we want to come back to IBM's distribution, Hadoop distribution. I mean, everybody had a Hadoop distro back in the day. I mean, it was Fujitsu and Wendisco and, you know, Silicon Angle, right? What's going on? I didn't realize. No, we used to joke about it because everybody was announcing them. And so, you know, IBM maintained its Hadoop distro. You got some customers that, you know, you're supporting, obviously. But it's not make or break for IBM. And if you're not the worldwide leader in Hadoop distro, it doesn't really make a bit of a difference to your business, right? You're doing it because you... Well, maybe a bit of difference, but... Maybe a tiny bit, right? But I mean, it's got Watson and analytics and cognitive and AI and machine learning and all this new stuff. And so, you maintain that because you've got a customer base, right? And there are clients who really value the, you know, superb level of support that IBM, you know, offers for all of its products around the world, 24 by seven. So, and there's a comfort zone saying, when you buy a Hadoop distribution from IBM, it's going to be around and it's going to be supported and it's going to be supported well. Right, and you can make a case that this is more integrated and we're going to support it and blah, blah, blah. Okay, great. But it's not like the be all end all of Hadoop distros. Okay, so we just wanted to sort of establish that. So, as you pointed out, you got to support multiple, you know, distros and, you know, partners. ODPI is, you guys are involved, obviously, Hortonworks and others. Talk a little bit more about... Many others. Talk about ODPI, why did ODPI come about? There were many naysayers when it first came about. You know, particularly the two other major, you know, distro vendors. What is ODPI all about? What has it done for businesses? So, I view it as having two levels. It's brought groups together that at some level compete in the marketplace. Say Hortonworks and IBM Big Insights. And at the same level, at the same time, we also have the desire to grow the market, to enable clients to be successful. And as you said, to generate those insights, the holy grail of all this big data activity is not the data. That's just sitting there doing nothing. It's to get insight from that data. So, again, an analogy I'll draw is that, you know, in the database space, IBM has competes and has competed and continues to compete vigorously with the oracles in the Microsoft SQL servers of the world. And yet, at the level of standardizing SQL, we all work together. So, you know, there's no disconnect between competing at the level of your product offerings and collaborating at standardizing SQL for the benefit of clients and the industry as a whole. And so, ODPI, to me, is an exact analogy of that, which you want to bring different companies together that compete at one level, but they collaborate to make sure that the plumbing is sufficiently common and standard. So that the layer of middleware and applications on top have an easier time, can get more traction, get more clients, and deliver more value. And so, there was a whole bunch of companies, some of which are, I would draw them into tiers. There's the foundational ones. There's four or five that produce Hadoop distributions. The biggest, the most famous of those are Hortonworks and IBM. Then there are people who sit on top and there's even consulting companies as well. So, different layers that use Hadoop from different layers. And customers. And customers as well. So, the four core components were HDFS, MapReduce, Ambari, and Yarn. And then Spark comes into play. Obviously, IBM's putting huge emphasis on Spark. You mentioned, you know, that MapReduce, in many respects, is legacy. So, how does ODPI evolve to include new innovations? Well, you have to start somewhere and you have to start with what a group of people think they can standardize. Some things that are, you know, imagine if you tried to standardize machine learning. Well, that'd be tricky, right? Because, you know, is it Spark? Is it TensorFlow? I mean, there's this profusion of techniques and companies. So, that means that that part of the ecosystem is not ready to be standardized. So, they had to pick the things that were most ready to be standardized, that were already quite similar. You know, that maybe they were like 98% the same. So, that last 2% is a solvable problem. And that's what they did. So, HDFS is relatively mature. You know, HDFS is going not evolving so quickly. And so, it was easier to converge. Same with, you know, MapReduce and also Yarn. Ambari is, you know, the next wave. So, that's kind of the one that's actively being worked on right now. And Ambari has, you know, installation configuration and monitoring. And the first piece of Ambari that they're working on is actually to, you know, make the deployment the same. In the same way that, think back to Windows. You know, there was a, Windows installation used to be a gazillion different methods, right? And then there was sort of a standard Windows installer and most companies converged on that. And that made for a similar, you know, install experience, then similar entries in the registry and so on. After a couple of lawsuits. And that's one of the things that you're trying to get collaboration, cooperation early on. So that you reduce the uncertainty about how some of these things are going to work. And collaboration where you're not really differentiating. You know, people are not going to buy your distribution because of, you know, you have an extra tweak on your Ambari installer. I mean, that's not really where you're creating value. You're just possibly creating pain if some other piece of software that you really want, you can't install it on the distribution that you've chosen. Okay, and then I'm trying to, I want to unpack the UNIX analogy versus the SQL analogy. And I'm thinking the SQL analogy maybe is better, but I want to explore that a little bit. So, you know, the UNIX analogy, you know, it ended up, UNIX ended up not working, really, as well as we had hoped. You know, the vision of having sort of binary compatibility. Well, you remember and DF. Yeah, well, exactly. And DF didn't quite play out. But we're talking about, we're talking about a level up above. Well, but so that's what I'm trying to understand is was Linux that level above that? And the SQL maybe the better analogy, which you used as well for what's going on with... So no analogy is perfect. You know, I think if Linus hadn't come along and invented Linux, who knows what would have happened in the world of UNIX? You know, there were... Sun might still be here. Possibly. Right, so I think SQL is a good analogy because of the active participation of all the major SQL vendors, even the ones that are, you know, definitely competing head on, you know, IBM and Oracle being a, you know, a good example. All right, so big SQL and SQL generally in big data, right? It's like the killer app for Hadoop. Ironically, right? Because, you know, when I was first introduced to Hadoop, it was kind of like the no SQL space. It kind of turned into the new SQL space. So I... That's the way of putting it. Oh, it really is. And I think it's not that SQL is perfect, but, you know, the thing that it is powerful. There are abundant skills and there's lots of tools. And those are things that really help people get going. And I think it's that second point that's so dominant that there was an enormous amount of experience around Unix that wanted to focus more time and attention on delivering value well above it. And so when you came up with a common way of looking at Unix that worked very well, obviously Linux, Linux, then people gravitated towards it. And it was the experience that made that so successful. And the same thing in SQL. There's an enormous amount of experience around SQL that's in manifests and tools, because the tools are pedagogical in that regard. And so at the end of the day, it's the community, the customer base that's driving this new notion of the commons around skills and requiring this kind of new approach to focusing on the underlying tool set so that that skill can be applied to bigger basis problems. Is that accurate? I would agree with that. You know, it raises the interface level to one that is known, proven to work. Familiar, it creates a new commons. Right. Okay, so what else can you tell us about what's going on at ODPI? Like we were talking about Spark a little bit. Does Spark fit in here? I mean, IBM's obviously making a huge investment in Spark. Are you pushing for Spark? Spark is my equal partner in ODPI. So, you know, Spark is a patchy project. So, I mean, a first-class citizen I meant to say. Right, so it's already being distributed by most, if not all, of the Hadoop distributions. I know MapR has it, Cloudera has it, Hortonworks has it, IOP and Big Insights has it. So, you know. That little Databricks company, right? Yeah, yeah, yeah. Well, they don't have their own distribution. They work on Spark and Spark fits into the Hadoop ecosystem, but of course, Spark also runs outside of a Hadoop ecosystem, which makes it interesting. So, but to come to your question of Spark for ODPI, you know, I'm gazing into my crystal ball. I'm going to say I would love to see Spark be a next project for ODPI to tackle. I think it makes a lot of sense, especially since we already talked about it. MapReduce is kind of this legacy variant of the Hadoop ecosystem. You know, kind of like Mahoot is a little bit legacy as well. So, it has a natural successor, more memory centric, you know, compute intensive, you know, has a lot of backing from many people, very active Apache project, seems like a really good candidate as a next wave. Now, you know, I'm not that deep into the org, so I don't know exactly where it is on their roadmap. I think it would make a lot of sense. All right, good. All right, Bernie, we'll give you the last word on Big Data Week, Big Data NYC, Hadoop, Strata Hadoop. Well, great to be here, great to see the energy. I'm, you know, I'm pleased to be able to present tomorrow about the availability of Big SQL on Hortonworks, something that many clients have asked. And, you know, I've had other clients say, and, you know, when our cloud era and map are going to join ODPI, so we can have Big SQL on cloud era and MapR as well. Ha ha ha, when ODPI gives up on Embaree, that's when... Ha ha ha, Bernie, thanks very much for coming on theCUBE. Really appreciate it. Thanks for the invitation, great to meet you. All right, keep right there, buddy. We'll be right back with our next guest right after this short break.