 Live from Union Square in the heart of San Francisco. It's theCUBE, covering Spark Summit 2016, brought to you by Databricks and IBM. Now here are your hosts, John Walls and George Gilbert. Well, welcome back to San Francisco, along with George Gilbert, who is a senior analyst at Wikibon in theCUBE. I'm John Walls, we're here on the Expo floor at Spark Summit 2016 in the Hilton and the heart of downtown San Fran and joined 3,500 attendees, a record crowd here, looking forward to talking with Dinesh Narmal, who's the Vice President of Big Data, Next Gen Platforms and Analytics at IBM Analytics, and Dinesh, thanks for joining us here on theCUBE. Good to see you. Thank you, John. Yeah, first off, just, let me get your take on this, because once the door's open today, it was like this madhouse rush and the activity, the enthusiasm, hasn't subsided all day. What do you make of the show? Amazing, I mean amazing. So, last year, I mean, you look at the keynote, I feel like there's the double, the number of folks in the keynote, just to show you the presence and the enthusiasm, the energy, I mean, it's just crazy. I mean... So what does that say to you? I mean, you've been looking at, certainly with Spark and working with it for a few years now, what does that say to you about the interest that's being generated and not just casual, but obviously very serious commitment? So it tells me that adoption, right? I mean, last year when we invested in our Spark Technology Center and put 50 plus developers into it, we knew this was going to kind of start taking off. And this year, it's a testament that it is taking off. There's more adoption. Our enterprise customers are seeing it. So just to give you one example, last year we hosted one of the biggest retailers in the world to come to our STC, it was one terabyte of data. This year, they came back. It's 17 terabytes of data that they are running on Spark. So it's just amazing that we feel it that customers are really picking up on it. It's a scalable platform. It's open source. You know, you can get the rich set of libraries in there. So all those things that are attracting the customers to come and run it. For analytics execution, this is the one. This is the platform. So what is it? I mean, let's talk about machine learning because that seems, you know, huge growth opportunity for sure. And where there's a lot of emerging technology being applied there. You all kind of want to be, as I understand it, you want to be like the Swiss here basically, right? You're total open source, totally committed. You're making these voluntary commitments of your technology, trying to create this broad and that's not even the right word, community. But you're sharing, you know, all the secret sauce. So let's, really good question. So let me take a step back, right? In the sense that what is machine learning? To me, machine learning is how do you train a model to get a predictable score or a score that's favorable? Right? That is machine learning. So for us, we looked at it say, okay, how can we make sure that, for example, let's take R, which is every single customer shop that I've gone to as an R, R shop. So it is getting adopted in a very accelerated space. That's why, I don't know if you guys read the announcement today that IBM is going to invest in R and I'm going to be a board member at the R Consortium, right? Because we see that adoption happening big time. So System ML is part of it. We contributed the code into the community because we see that we can give a rich set of libraries that folks, developers can use to build models and have machine learning stuff. It's sort of unpacked that a little more. Now, machine learning, I mean, our viewers are familiar with the concept, but tell us how we know that Spark has Spark ML for the pipeline and ML lib for all the machine learning libraries help rationalize where System ML comes in and then the other machine learning libraries that you have from different sources. So what does that broad choice start to look like in a coherent way? Right, so if you are a developer today, you look at our vendors out there, you can use Java, you can use R, you can use Python, Scala, right? Any of those languages. And then you can use Spark ML or System ML, right? R libraries, all those things to build your models. What we want to do in this, and we really believe machine learning is probably, if you compare it to a baseball game, it's probably in the first inning or second inning. So there's a ways to go. So what we envision is that let's make it flexible for the developer, right? So you want to develop in Python, Scala, or any languages, right? We want to give you the choice, just be it. We don't want to silo you into one thing. You want to use Spark ML, you want to use System ML, you want to use R, you can do it. So the way we envision is that let's give the developer the choice on what language they want to use. So if we were to draw like a sort of a table, you would have the choice of libraries going on one dimension and the choice of pipelines on the other dimension and just to differentiate. So Spark ML assumes that the pipeline essentially has like one model. So that it, you know, it's a rough approximation. Either fits that model, you know, the problem you're trying to solve fits that model or it doesn't. Where your System ML that you've contributed can put a bunch of models into play so if one doesn't get it completely right, the other can finish getting it right or a bunch of them. So you get, you know, a set of rich libraries through all this, whether it's System ML or any of this, right? So your goal is that, so if you're a statistician, right? What do you want to do? You want to do linear regression. You pick a library and say, okay, this is what I'm going to use. This is the language I'm going to use. Some of them really code, some of them want a UI, right? So what in IBM we want to do is that how can we give the end user, the data scientist, a multitude of choices that they can pick and choose from rather than just say, you want to develop, use Python, that's all we're going to do. Use Scala, that's what we're going to do. This may be arcane, but since we're talking about unification, does it matter if the Databricks folks or the Spark community creates a notebook that knows how to deal with the Spark native pipeline, Spark ML, where would System ML fit in that? So System ML, you can use it. We have contributed it back into the community, right? So it's available, we are contributing back. So if you look at our Spark Technology Center, the two areas that we are focusing our Spark SQL, because we have DB2 and we come with a good set of expertise in that area. The second piece is ML, right? So Nick Perrant, who we hired, you know, who's one of our committer, I mean, he is solely focused on the ML piece of things. So what we are hoping is that as this area grows and explodes, they come to us, right, for machine learning or Spark SQL expertise. IBM needs to be known saying that, okay, you know, machine learning, here are the expertise. But then we need to look at it as a community is to say, you know, we are contributing it back. So we are heavily contributing and Spark SQL, SQL was one of our biggest contributions last year. So the SQL contributions that you're making, does that fold into, or should I say, how does that fold into the maturation of Spark SQL? Is it better coverage? So like you can run more of the benchmarks, or is it improving the optimizer so things run really fast? Exactly. So there are areas in the optimizer we have to do, right? The work that needs to be done. How do we make sure, for example, the locking mechanism, right? I mean, how do you make sure it can lock once somebody's reading or writing into it, right? I mean, those kind of things. I didn't know that that was even one of the use cases. I thought that the interface between Spark and databases was mostly a read-only. Right, it's read-only, but I'm just saying, like, I mean, if you want to do, so some customers, for example, right? I mean, they have relational databases who want to lock it, right? I mean, in that specific scenario or case, but mostly it's read-only, right? So if I, this might sound arcane, but it's important because right now, it's a sort of a business intelligence usage scenario. But if you're getting into locking, you're starting to get into operational intelligence where you're going to do some amount of transacting with a little bit of analysis. Is that the direction? Right, but in this particular case, we are only talking about read-only, right? In the sense, read-only. Because, so, optimizer comes into play, right? So that's one. The other area is that how do we make sure, like, we are trying to do a TPDS benchmark to see how it converts, how the queries come back, what is the performance times, right? So there's some contributions to be made there. So those are the areas that we are focusing on from a Spark SQL perspective and ML perspective. And one last piece I do want to cover is our platform, right? So when we started talking about how do we IBM build the platform, there's three things that we looked at. One, how do we build a scalable, open-source-based architecture, right? So that's, we decided Spark is the one we want to go with. The second piece is how do we differentiate? Because today, if you look at Qualcomm, it's there. So we said, okay, we need to bring the integration piece together, so if you're ingesting the data, like the weather channel data, they can ingest about five gigs a second. How do we ingest the data, shape the data, transform the data, cleanse the data, all those pieces? And then how do we do analytics on it, right? Build the model, train the model. That's where the machine learning piece comes in and then eventually visualize it. So that whole piece, what do you call pipeline or the whole platform has to be well-stitched together. So that's what we're doing. And the last piece is, when I talk about differentiation, today you go to AWS or SoftLayer or Azure, you have to bring the data with you and do analytics. What we want to do is bring analytics to where your data is. So I say it, the doctor will... So you don't have to transfer the data, they just... Fides, exactly. You go to it. Right, so I say, we'll bring the doctor to where the patient is, you know? Because we bring... A house call. Yeah, a house call. So we bring analytics to where the data is. And that clearly differentiates us, right? Because we will bring analytics to where the data is. So there are customers who probably are on AWS, heavily invested in S3, for example. They might not move the data. Or there are customers who are on BlueMix or SoftLayer who doesn't want to move the data. How do we serve those customers, right? And that's what the platform brings is that, you know, three things. Built on open source to scale, integrated. And finally, how do we make sure we differentiate by bringing the data, I mean, the analytics to where the data is? If I add one thing to see if I can sum this up, we talked yesterday to the guy who managed all the data services. Yeah. It was like, he came in as part of Cloud and now he runs all the... Oh, Derek. Derek, surely. Yes. So he said the key is now they're trying to, essentially orchestrate all the data platforms into a fabric. And then the analytic feeds and data orchestrate those. Now, it sounds like you're one level, you're investigating and preparing to go one level above that, which is modeling and training and deploying things that either are offline or learning in real time. And that that's the next layer up. So this morning I spent some time with the customer who wanted to do fraud detection. And they want to train a model and how do they make sure that the model over time, the data degrades, right? But at the same time, how can we take the real time scoring data, feed it back so that the model keeps learning itself, you know, real time? We were just talking about that just before you came on with one of the product managers at Databricks. Exactly that fraud app. Right, exactly. So that, those are the kind of things, right? That will get the buzz on top of Spark is that how to be real time score and feed it back called the feedback loop so that the model is learning. Even when the scoring is low, it learns. So we are actually, you know, working with a customer, a banking customer in Europe to say, how can we do that? And by September timeframe, our plan is to, you know, get it up and running. And this particular customer loses about 70 million euros a year in fraud. So think about, you know, that's just one customer. So I think there's tremendous potential and running it on Spark gives us a platform to, you know, an execution piece to go run it on. Like the notion we talked about a few moments ago that doctors don't have house calls anymore, but IBM analytics will. So good luck with that. You have a lot of houses, a lot of doors to knock on, which is a good thing. I know. Thanks for joining us. Well, thank you, John. It was a pleasure. Thank you. The Cube coverage here at Spark Summit 2016 continues in just a bit.