 from New York, extracting the signal from the noise. It's theCUBE, covering Spark Summit East, brought to you by Spark Summit. Now your hosts, Dave Vellante and George Gilbert. Spark Summit East, this is theCUBE. theCUBE goes out to the events, we extract the signal from the noise. This is day one for Spark Summit East. We're running a crowd chat, crowd chat.net, slash Spark Summit, John Furrier's in there, going crazy with Bert Lattimore. We are here, wall-to-wall coverage. Anjul Bambri is here, he's the Vice President of Engineering at IBM, focused on big data. Anjul, great to see you again. Thanks for coming back in theCUBE. Thank you for having me. So, IBM, are you guys doing anything at Spark? Hey. Some investments that you made, so give us the update, what's happening? Sure. I mean, you know that we've been, since 2010, really been focused on, how do we help our customers with big data, with analytics, and we embraced Apache Hadoop, and then, several years ago, we felt that Spark is really a game-changing technology from an analytics standpoint. So we've embraced Spark, we are re-platforming our products on Spark, we opened a Spark Technology Center in the beautiful San Francisco, so you cannot miss that. We actually were there, did a short interview with Rob Thomas. Yeah, yeah, so that's on Market Street, and that's where we have set up a team which has committers, and they are working in the open source, so all the way from fixing bugs to contributing to Spark car, Spark SQL, so there is a lot happening there. We, last year, open sourced System ML, which is where we really want the machine learning community to embrace it and extend it further and integrate it with MLI, and that's like a sequel for machine learning. That's how I would like to describe it, and it has a very advanced machine learning algorithm optimizer, and of course, we are helping our customers understand Spark, where to leverage it. We also ship it as a part of our products, both on-prem and on the cloud, so we ship it as a part of Big Insights, which is our Hadoop offering, so that customers, as they need to transition from MapReduce to Spark for more real-time, they can do that, and we have Spark as a service on the cloud, so that's... So SQL is like the killer app. So big demand for Spark SQL, is that right? That's true, that's right. And now you're saying that System ML is like SQL for machine learning, so SQL, I mean, I guess I know why, but tell our audience, why SQL? Why is it so attractive to developers? So when you look at what people are trying to do with the technologies like Spark, some of the use cases that folks start with is all around still the warehouse, and like Matei was saying this morning, I don't know whether it was Matei or Ali, that just-in-time warehousing is a very popular use case. So the moment you come to warehousing, yes, everybody wants it to be more and more real-time, but to do ETL in the warehouse, to where Spark SQL and Spark Streaming may take it to a whole new level, and I mean, there are many, many SQL developers who are out there, and once your data is in this warehouse, SQL is still the best way for you to query it, so SQL is still very popular, and what we are doing is that we have also provided, at least from an IBM customer standpoint, we have provided connectors to DB2 and to NetEaser, so that the predicate pushdown can happen to DB2 and NetEaser, and we are also supporting the SQL dialect of DB2 and NetEaser in Spark SQL, so that people can seamlessly use it without worrying about where their data is. We had Matei on earlier, and well, because certain of our management structure was temporarily unavailable, we really geeked out with him. And he was telling us about the way they do SQL, which is, they don't have the storage layer, and so they don't have all the information about the data that makes a query engine go so fast. So, right, but he said something interesting which was, they don't do the ETL to get stuff into the engine, so that whole process of loading the data is actually when they learn about the data, so if you add up all the time it takes to get stuff into a warehouse, if you time the whole thing, getting stuff in and doing the analysis, they're very competitive with the dedicated, got to bring the stuff in, then got to ask the questions. And that sounds like, if you educate your customers about that with NetEaser and DB2, it's like a just-in-time warehouse, but with all the sources. That sounds very powerful, and something that I don't think we understood before. That's true, I mean, the traditional way used to be that you have all this data, you first ETL it into a warehouse, and that itself, writing those ETL jobs and sort of doing the transformation and the, so it used to be like extract, then you transform and you load it into the warehouse. So, some of our telecommunication customers, some of those ETL jobs used to take like three weeks, four weeks, I mean, that's how the world was. And after that you've populated the warehouse, and then you start doing your querying and reporting and all the analysis. But these days, this is all about real time. So, if the eliminating of that whole ETL process, it's not, I wouldn't say it's eliminated, but it's become more like ELT, that you extract, load, and then the transformation happens based on what is it that you're trying to do. So, it's a schema on read as opposed to schema on write. And, you know, we just announced something called quarks. I don't know if you heard of that. This is something that we just open sourced, and I believe it was announced yesterday. And this is a whole platform to help build end-to-end IoT applications, right? Now, when you look at this whole, what is happening in IoT, you know, the real time aspect of everything is like super critical, right? Like, you know, if you have just, I'm going to connect it to ETL. So, you know, if you have, say, a heavy truck, right, which is beyond its weight limit, and it's going to crossing a bridge or something, and it's really windy, and you don't want that truck to be on the bridge because it may topple over or... And, you know, now all of these vehicles, right, they are transmitting data. So when you figure out that this truck should not be on the bridge, if it takes you three weeks or even an hour to ETL that data, query it and say that, you know, you shouldn't really be on the bridge, I think it's too late. Yeah. So with, you know, like just, I mean, that's a simple example, but that's where you want to be getting that data, doing the analysis in real time, and be, you know, avoiding that truck to really not even get on the bridge. And this is where, like a classical ETL is not going to do it. But that also requires, I think, what our CTO and co-founder calls the edge intelligence. So in that example, some of that, like weather data may be coming from an outside feed, where is that intelligence running to tell the truck, don't get on that bridge? So definitely, so what we are doing is with Quarks, we are, it has a framework, it's kind of very lightweight, which is embedded at the edge node. And you're absolutely right that, you know, the things have to be figured out there. There's no time to populate it in the warehouse and then, you know, do these kinds of analysis and figure that out. Right, okay. Yeah, so we've been paying a lot of attention to the, you know, the whole edge computing piece and even there's a step before that for IoT, which is you got to, you better have connectivity at the asset. The windmill doesn't have internet. It really sort of defeats the purpose. I want to explore this notion we were talking about off camera of the analytics OS. What is the analytics OS? What is your team doing there? Yeah, so, you know, we've, from what you had said, right, what are we doing at Spark? Well, other than the things that I mentioned, we really consider Spark to be, you know, a game-changing technology from an analytics standpoint and we consider that to be the analytics operating system. So if you look at, you know, all the different capabilities that are in the Spark platform, I mean, those are all foundational capabilities of what, you know, an analytics application will need, that there's all kinds of structured data that you may have to query. So you have Spark SQL to do that. There could be unstructured semi-structured data and so, you know, one can drop to the Spark core and, you know, write your own logic to deal with that. There is data that is coming really fast where you need Spark streaming. You know, if you're building, say, predictive models, you could make use of MLib and machine learning and all of these components in the Spark stack, they work seamlessly together, right? So it's not that you have one silo to deal with SQL, another silo to deal with, you know, predictive analytics and other silo product to deal with streaming. I mean, that used to be the world before, that if you had to build an analytic application which was dealing with structured, semi-structured, unstructured streaming data, you would need half a dozen products to bring it all together, right? And each with its own install and configuration and own language and own set of application development tools with Spark, that is not the case. So once the data is entered, I mean, analytics is all about, you can ask all kinds of questions, right? You can say what has happened. So you can use Spark SQL. What is going to happen? So you can build predictive models. You know, what's the next set of best actions that you should take? Now, all those questions when you can answer just with everything using just one foundation, I mean, that's an analytics operating system. And it's an interesting move for IBM, right? Because you had all these bespoke products and now you say, well, we're going to invest, it's kind of like we did with Linux in that one. We're going to put our muscle behind that one. How did that decision get made? Was that just, oh, everybody said, oh, yeah, great. I would expect some of the product managers who own those products. I was like, whoa, wait a minute. We should bring those product managers on the queue. So they can answer those questions. No, but I mean, jokes. It required some leadership to say, okay, we're going to do this and Corral heard the cats, so to speak. Yeah, yeah. And obviously there is a lot of very good assets that are there in those products. And over the last couple of years, we have been replatforming some of these core products and engines onto Spark. I think you may have heard this from others too, that we did that with our ETL engine. I mean, there is a- Is this data work? This is data works where, you know, see the tools that Spark provides and for you to build these kinds of solutions and the speed and then Scala being, you know, the language in which Spark is implemented. I mean, we were able to replatform data works in about one year, right? Which is, it's almost like mission impossible without having the power of what Spark provides. Same thing we have done with SPSS. Like last year, we released... So obviously, first what was happening was that all the capabilities that Spark provides, each of these products, we were building those on our own, right? Now you have to do a surgery of those products and kind of, I mean, I'm maybe oversimplifying it, but you have to get rid of those layers and really replatform it on Spark. But you still, you know, the capabilities that are there, you don't want to throw those away. So with SPSS, what we've done is like from a predictive analytics standpoint, we are pushing down to Spark core and we are combining the power of MLib and SystemML so that the predictive model execution has, you know, I mean, this is all still early, but we've seen speed ups of three to six X times and now the sizes of the data are not, you know, intense of terabytes, but they are in hundreds of terabytes. Sorry, what was the performance improvement? You said... About three to six X, yeah. This is what we have seen just in our, you know, engagements in the last six to eight months. Three to six X performance and not tens of terabytes, but into... But in like 100 to 300 terabytes is what we have seen customers start using it, but then the data sizes are only going to grow, right? So there may be a 300 today, it's going to get to 500 tomorrow. So take SAS, which has not been rewritten, but from what I understand has this sort of like engine where, you know, you take SAS and whatever SAS generates some sort of translation layer, then spits it out into Hadoop. What are their, you know, constraints? Do they have to go native to be able to now match what you're doing with SPSS? So a lot of, I mean, you know, the architecture that SAS has had has been similar to now what Spark provides, but Spark is an open source. There's a huge community behind it, right? If you heard Mate, they're like 1000 contributors on Spark. I mean, that is phenomenal. So, you know, some technology that is advancing at that rate, I mean, I would personally rather be on that technology. And, you know, where smart people like Mate are behind it, than on something proprietary. And IBM, I mean, you guys have a lot to do with that number, you know, increasing. We're very short of time, but please. A quick question. Where you've got the analytic OS and the wonderfully unified engine in Spark, there's some things that the Spark guys just didn't address. Manageability, storage, you know, persistence. What's IBM's approach for, you know, those sort of missing bits? So, you know, what we do is we distribute. Like, so we have big insights, right? And that is where we have the Apache Hadoop stack. And for some of the things like, you know, whereas if things have to be persisted, so we are offering to the customers that, at least on-prem, they could do that on HDFS. You know, they could use Mbari as a way to install and configure everything. And, you know, yarn from a resource management standpoint. So that's what at least we are seeing, you know, what the on-prem customers are doing. And then, of course, on the cloud, you know, almost every cloud provider is also providing some kind of an object store. So we have Spark as a service on the cloud. Even IBM now was the acquisition of Cleversafe. So we have, so we are offering, you know, both HDFS as well as Swift. And, you know, we are also sort of decoupling the whole compute and the store, right? That's really the trend. So that customers can pick, right? What's their compute engine going to be? What's their storage going to be? You know, which resource manager do they want to use in terms of install, configuration, manageability? We offer both Mbari as well as platform Symphony. So those are the choices that we are giving to the customers. Excellent. All right, so we'll give you the last word, Anjul. Then we got to jump, but things we should be paying attention in 2016 for IBM. What's, what are some of the milestones that you want to hit? What should observers be watching? So, you know, when you look at this whole analytic space, right, there is obviously the core technologies like Spark are making a lot of things, you know, much faster that, you know, you can in real time be making decisions. But the skill gap is still huge, right? Every customer that you talk to. And I think, you know, it was there in the keynote as well today that people are still finding it hard to get their arms around all of these technologies. They are very powerful, but there is complexity. And for that, the tooling that is needed so that you can really build these application end to end is going to be very key. So that, you know, all the complexity has to be hidden. So, we are focused on that from, you know, you can think of it as having for the right persona, having the right kind of tools or workbench available so that they can build these applications in a much faster way. You know, data scientists, a lot of the programmers are looking at notebooks. We saw a really good demonstration today in the keynote as to how notebooks are being leveraged. But then there's the line of business who are not, who may not go the notebook route. So they need a way so that, you know, the discovery and of what is the data, what are the attributes of the data, how do you get from that to insights in almost by writing no code, right? I mean, that's important. So that line of business has full control that they can make the decisions. I mean, they want that they're making decisions in real time on fresh data. That is very important. And to be able to do that, they have to get the tools in their own hands. So that's what we are focused on. If that happens in 2016, you guys are going to make a lot of money. Here it is. All right, Ashil, thanks very much for coming on theCUBE. Great to see you again. Thank you so much. Keep right there, everybody. We'll be back with George's presentation coming up. He's going to release the industry's first ever Spark forecast. All right, keep right there. We're right back.