 Live from the San Jose Convention Center. Extracting the signal from the noise. It's the Cube. Covering Hadoop Summit 2015. Brought to you by headline sponsor Hortonworks. And by EMC. Pivotal. IBM. Pentaho. Teradata. Syncsort. And by Atunity. And now your hosts John Furrier and George Gilbert. Hello everyone. Welcome back to the Cube. Live in Silicon Valley at Hadoop Summit 2015. This is the Cube, our flagship program. We go out to the events and extract the signal from the noise. I'm John Furrier. With my co-host George Gilbert, our big data analyst at wikibon.com. And our next guest is Adriana Zumbieri, Program Director. Big sequel and virtualization development at IBM. And this is a big title. Call on card. Yeah. A lot of people. So a lot of responsibility. Welcome to the Cube. Thank you. So IBM is making a lot of moves. Obviously, we've been covering on the IBM events with the Cube. So you know, we're briefed on everything you guys are doing. Certainly great story. Cloud mobile social, big data, big part of it. We've done zillions of crowd chats with the team. IBM and big data kind of synonymous. A lot of players coming in, got Oracle, got EMC, HP. It's big data is in every enterprise. I mean, gardeners report about Hadoop. 50% going to be buying and evaluating production Hadoop. Still a big number. So a lot of push on Hadoop. You guys have a huge investment going on with Spark. We'll hear more about that on Monday. Bloomberg released a little bit of teaser on that, kind of breaking that out, breaking the news. But what's going on? I mean, there's a lot going on in analytics. It's not just unstructured data. There's still SQL out there. There's still the engine, but there's open source here. Presto from Teradata is out there. Open source is booming. What's going on with IBM? Give us a quick update on IBM analytics, SQL on Hadoop, et cetera. Well, actually, I think that in the world of big data, one of the things on why we have maybe a slow adoption at the beginning and now it's exploiting is it's hard for people to actually make a mindset where you have a complete new technology. But in companies, they need to invest in new skills. They have a lot already writing on SQL in their own companies with their warehouses. I don't think that warehouses are going to go anywhere anytime soon. Obviously, there are things that Hadoop is great for, but there's also all these investments they have. So I think that SQL provides really our customers with a very easy way to reuse the existing skills, the existing applications, and be able to exploit this new world of big data in many, many ways. So I think that's why and last year, Merve already Adrian here at the SQL summit commented on 2015 is the year of SQL and Hadoop. There are like so many vendors today out there with a SQL solution on Hadoop. What we have done in IBM with big SQL is we have been investing on SQL technology for decades. So we know how to do SQL. We invented it. Actually, I like to brag about that once in a while. So what could we do for Hadoop? And that's what we did with big SQL two years ago. We took all the technology we had and we teach it Hadoop. So as you were saying, there's more than a dozen solutions or products to SQL and Hadoop. What made it so easy for that sort of technology to diffuse to so many organizations? And then what's unique about IBM's implementation? I think what's unique with us is we took the approach that it was a little bit different than the other major database vendors because we think that Hadoop is an open ecosystem where we need to share. So big SQL, even though we are actually given a very mature technology in terms of the engine, we don't have any proprietary data formats, right? We're completely open. Like Parquet. We actually support all data formats that exist today. We don't have one that is prepared to IBM. So we play nice in that sense. We support Parquet or C, any other data format today out there. At the same time, we actually have a very strong optimizer and all the technologies you can expect from a traditional relational engine. We have a very powerful SQL itself, like the expressiveness of the SQL is one of the keys in this world. In the broad coverage of the whole language. Exactly. We have the same coverage that we expect in a regular traditional data warehouse. And why is that important is not only for the expressiveness of the SQL you can actually have, but for the simple fact that you can migrate your applications with very, very low effort. You don't have to modify to adapt to the limitations you have in your SQL. You can use existing third-party applications and now wrap them on Hadoop with zero effort. So that's why it's actually really set us apart. So we love the SQL on Hadoop because it's a dynamic environment. We even covering companies like Hedapt, which got acquired by Teradata. It was a big thing. Every year it's a shiny new toy. It's streaming here and all this is going on, but they didn't make it a standalone company. They were acquired Hedapt in particular others. Why is it such that SQL on Hadoop can't fund a startup to be successful? Is it because it's just a feature or is it because it's part of a bigger picture? Is it an existing market space? Is it hard? All the above? Why is it difficult to launch a company to corner the SQL on Hadoop market? Well, I think that SQL is something that, as a language, is very simple in the sense that you can write a parser and have a SQL engine. But what you really need to make it successful is a lot of technology that behind it. The Optimizer that we build, the Cosmet Optimizer, the query rewrites we have, all the smarts behind having a SQL engine, security, auditing, it's not something you can write in a couple of months. It's something that it takes years and it takes a lot of technology to make it happen. Oracle reminds us that they've been working on it 37 years and counting. But John raises a point which is we have maybe two dozen of products that are out there. If they're not standalone, what sort of platform should they sort of be part of to make a use case that's an analytic data pipeline? Should it be an integrated SQL DBMS? Should it be sort of deconstructed into something that's a query engine, machine learning, streaming engine? All the things Oracle, for instance, baked together. Are we unbundling that? What is this going to look like? Well, I think that you need obviously an ecosystem of things to do. You cannot have a hammer and use only one tool for everything. You need different tools for different things. And SQL is one of those tools. But you need many others. You need streaming and you need, now everybody's talking about Spark, which is a great MPP engine that is very general. But there are things in the SQL engine that I think that you need for customers to do. For example, one of the things that I believe it's important to have in this SQL on Hadoop is federation. So customers want to issue queries. And they want to issue queries against their Hadoop. But they also want to bring their data from their existing data warehouses, Oracle data. That's, for example, one of the things that we have with big SQL that is unique in this space. And I think it was one big differentiator. So there are enterprise capabilities. You know, Hadoop is immature in many, many areas. And we brought a mature solution for SQL on Hadoop. So about Spark, what's the impact of Spark? Because what we're now seeing is SQL on Hadoop, you got Spark, you have all kinds of other potential software innovations happening in and around the data. So the tooling is there. I agree, there's multiple tools in the tool chest. How do customers deal with this? Because we talk about SQL on Hadoop. Our customers saying the same language, give me SQL on Hadoop. And how do you guys manage the engineering projects around that? So from a customer standpoint, what are they looking for? What's the language of the customer? And how does that translate into engineering? Well, we work with a lot of customers to try to understand what trying to do around analytics, right? And different customers like have different views on how to do analytics. There are customers that actually like to do analytics only through SQL. There are other customers that say, no, I'm going to move into something else like our Spark. So obviously, we want to try to cater to all the groups as much as we can. One of the things we're doing with Big SQL right now is we'll try to integrate with Spark. We don't need to compete with Spark, right? What we want is we want to be able, for example, in any data flow that you have, instead of calling Spark SQL, be able to call Big SQL. And with that have all the extra functionality and power that you can have with Big SQL. Or from Big SQL, you want to be able to call any Spark application and get, you know, our database back and continue processing on SQL. So they're complementary. But again, it depends on what customers are doing and how they want to invest themselves. How about federation? That's come up in the conversations in the past, federation of access, federation of data. How do you guys view that as a guiding principle? What's the mindset at IBM? We look at the data out there and the access to the data, software applications. And now with the tsunami of apps coming, workloads having technology under the covers like virtualization and other things. There's a data issue out there. Well, I think that having the opinion that, you know, all the data will move into one place is very naive. I think that, you know, when you talk to customers, especially with very large data warehouses, nobody wants to go to one place. Many customers have multiple data warehouses from different vendors. You know, that's the reality that's not going to go away. So we needed a way to give customers the ability to access all that data. We don't have them to move the data. That's why virtualization comes in place. And that's why I think that if you have that capability coming from the big data world into the rest, I think it's very powerful because, you know, when you actually do that, you have to have some intelligence. You can actually go, you know, send the query and just bring data back. You need to have some intelligence how to create that plan and trying to minimize data movement. And I think there's a lot of technology we have in IBM having for years that we actually do that. I guess I'm understanding your title differently now as opposed to virtualization like hypervisor, you know, abstracting the OS or the server. It sounds like you mean data virtualization. And that's something that the other sequel on Hadoop engines, with the possible exception of Presto, we haven't really heard much about. Tell us some more. That sounds like that's great. And it's a great trend too. I mean, some cutting edge startups like primary data, recently got funded exfusion IO guys, you know, David Flynn, those guys, smart. Yeah, virtualizing data is interesting, right? Because now you can do things with data completely agnostic of where it travels. Correct. And I think that sometimes when, in particular around Hadoop, that some people want to use Hadoop as a sandbox. They want to actually play with the data, try to understand it. And they don't want to have to move the data around from their warehouses in order to actually try to understand. Now, maybe once they actually play enough, they actually decide to actually move the data. But what you're actually creating, you don't want to move that data around. You just want to do the query and let the data live what it is. So from big sequel today, you can access in the same query data from Teradata, Oracle, NetEase, DB2, DB2Z. But this is like, I guess the techie term is non-trivial. No, it's not. IBM started earlier, but if Oracle worked 37 years on its query optimizer, that's when it knew where the data was within Oracle itself. Now, if you have to go figure out where all the data is outside IBM or the IBM engine, for one, that sounds like it's a platform for a broad scale of analytics. But two, it sounds like a very large effort. We have had technology to do this under Federation server for a very long time. And we have a very smart optimizer that knows how to decompose queries and only send the pieces of queries that make sense to the other data sources and exploit those servers to actually do part of the work and minimize the data travel. So it gets incorporated. So that's technology you've had and you're bringing it over to... Correct. I think that we have been reusing a lot of the technology and that we have developed over the decades and now applying it to Hadoop. That's why we bring very mature technology into a very new market and makes us very different. So will this... Will BigSeq will be the foundation for your analytic apps or will there be other engines? I think that there is no more one answer to this. I think that, again, it really depends on how customers want to deploy in their shops. There will be certainly a big number of customers that want to still keep SQL at their center of their universe and where they are for them. But there will be customers that want to move into the new thing and will be there for them too. That's the great thing about IBM, right? We can have, I wouldn't say all, but we are a large company. We come from a big ecosystem. So big data has become fashionable and well explained to some extent over the last few years. Fast data is becoming something that we need to consider in terms of time from sort of ingest to action. SQL isn't always assumed to be the way to do that. How should we think about fitting that into an analytic application? What do you mean by fast data? Fast data, where between the time you ingest it or capture it and the time when you can use it to make it drive a decision? We're talking about streaming data, streaming data, streaming analytics. We do have infosphere stream that we have a solution for that within IBM. When you say SQL doesn't have to be the only answer, that's an example of... That's an example. Even within that, I think when we talk about SQL and Hadoop and talk about performance, last year, as I mentioned, there were something like 17 SQL engines on Hadoop claiming to be the fastest when I came to the Hadoop Summit last year. So when I went back to IBM, I said, you know, we have to do something about this. This is a wild west of everybody claiming to be... We got to beat their benchmarks. Yeah, so we actually published the first independent audited benchmark on SQL on Hadoop, which is publicly available. It was audited by an independent auditor, and we showed that we are faster and nobody has challenged that since last year when we published in October. A lot of startups are groping for that benchmark just to kind of have a flag flag to plant. Obviously, IBM's got big R&D, but I got to ask you to kind of end their segment. What's the coolest thing you're working on right now in terms of technology? I mean there's a lot of stuff you SQL on Hadoop. It's like, it's like, you know, for the enterprise, it's like waking up, getting trash brushing your teeth, it's out there, it's normal operationally stable. People use SQL pretty standard. But now Hadoop gives more folks. I get that. There's going to be a lot of stuff orbiting around this, like Spark you mentioned. So what's the cool things that you're working on right now that you could share? I think there are two parts on cool. One is cool that it would be like from the marketing perspective, and also like from the application perspective, certainly our integration with Spark is one of the coolest things we're working on, right? I think that's interesting. But from the geeky side, it's very interesting to try to solve the challenges of having a SQL engine on top of something that was not meant to have a SQL engine on top, right? When you have to deal with, you know, how do you do collocated joins or how do you do certain things that are not meant to be? This is the virtualization part. No, this is actually how to run SQL on on on HDFS and on on a falsely on Hadoop that it was not meant to actually, you know, have the ownership. We don't have the ownership of the data that you have in a traditional data warehouse. Well, Spark does the same thing where stay away from storage. Yeah, but in the data warehouses, a lot relies on how you optimize your queries, knowing about how you distribute your data, where your data is, how you can actually collocate your joins. So there was a lot of technology underneath. So that's the geeky part, maybe. But in terms, I think your question was more... We like the geeky part. Bring the geek on. We get our geek on. We're getting a hook here. Okay, so quick summary as we end the segment. What is for the folks out there, what's the show about this year? You mentioned, you know, past couple of years people flexing their muscles, you know, sequel on Hadoop startups kind of leveled out, consolidated. It's operationalized as you mentioned. What's this year's theme? Share with the folks watching. What's going on at Hadoop Summit? What does it mean for the big data ecosystem? What's the vibe? Where's it going? What's your forecast? I think that the big difference, at least personally for me from last year and this year, is last year we talked a lot about the technology and how we're building this technology. I think this year is more about what is people doing with it? We're finally having, you know, customers in production. We have people doing real things. We have large customers exploiting it and seeing that they can answer questions that, you know, that could not even get answers before. Like today I'm talking with, I have a presentation I'm going to do here at the summit talking about what Seagate is doing with our, you know, sequel on Hadoop solution. I think that's very interesting to see how, you know, they are changing their businesses based on what the new things they can do. Awesome. IBM here inside the Cube sharing the insights, obviously analytics and if you're hungry and you don't know what to make, always call Watson up, they have the big data, they can tell you what recipes to make. Obviously a little play on IBM's Watson which has been a great example of how analytics fits the real world. The stuff under the cover, sequel spark, a lot of greatness coming. We'll see more next week. We'll see you in San Francisco Cube. We'll be right back live at Silicon Valley here at Hadoop Summit 2015. I'm John Furrier, George Gilbert, our big data analyst at Wikibon and go to crowdchat.net slash Hadoop Summit. Join the conversation. We'll be right back after this short break.