 From the noise, it's theCUBE covering Spark Summit East. Brought to you by Spark Summit. Now your hosts, Jeff Frick and George Gilbert. Hey, welcome back everybody. Jeff Frick here with theCUBE. We are live in Midtown Manhattan at the Spark Summit East Show. I mean, 2,000 people, 3,000 people running around talking Spark, talking big data. Early stages in this conference. So it's a really interactive, engaging group and it's a really fun time to be part of it. We're happy to be here. We're joined by our next guest, Dirk DeRus, the Data Lake product strategy from IBM. Welcome Dirk. Thank you for having me. Absolutely. So what do you think of the vibe of the show? You've been to this before? I have. It's definitely grown. One thing that strikes me about this conference, other than a lot of the other conferences I go to and I know you go to many of them as well, is how technical it is. And so you see, like a traditional IBM conference is very much focused on the more business oriented kind of user or like the CTO kind of that level of conversation. Whereas when you come here, it's different. Not that it's bad or good, it is what it is. So here you have people that are developers figuring out Scala or your statistician type of people or people call them data scientists today that are very much into programming R or even statistical Python work or people that are just thinking about how do I distribute workload more efficiently than in like old MapReduce, for instance. And so a lot of those business stories that we see at some other conferences, you don't see those as much here because it's much more, I mean the audience is obviously very different. Yeah, it's early days and it's really a lot of sharing of information, not only is the technology in the weeds, but the people that are executing at that level and I think we had a prior guest say, you know, you don't often see lines of code in a keynote presentation. No, especially the lead off presentation. Here you go. Well, so tell us, in our research, we talk about systems of intelligence and we talk about it as a journey, not as a destination. And that sort of first stage is doing something better, faster, cheaper than you used to do. The data lake, as opposed to the data warehouse. And it's not the exact same thing, we sort of data warehouse, plus, plus, minus, minus. You add structure later so you can do more iterative analysis, you know, project the compute on the data, whole bunch of things. How have you seen customers use that as a jumping off point for their journey, say, those who started several years ago versus those starting now? Yeah, for a lot of the customers that I've worked with, it's been a frustrating journey. So when we started talking about Hadoop, you know, five, six, seven years ago, and we've continued to have those kind of conversations about how to scale out, both in volume of data and then paralyze work and do all these magical things and have these insights come flying at you, that hasn't happened for nearly anyone. And a lot of that- For the original. Yeah, yeah. And even now, I mean, we're seeing, we're seeing kind of, I mean, there are success stories, but by and large, they're fairly limited. When I consider success, it comes back to, you know, the point I was making earlier about that, you know, business user that's comfortable with Excel spreadsheets or like a Tableau or Cognos type of reporting tool, like how are they able to take advantage of that scale, right? And so there's still, most of those people are still at the point where they're doing a lot of things manually and when it comes to the scale conversations, if I'm taking data from social or data outside my warehouse, integrating that into a warehouse or even having like a full body type of analysis where you're looking at multiple years of data, that's still not happening for that business user. So some of the original promises that have been, you know, talked about on a lot of PowerPoint slides and stuff like that, that still hasn't quite come to fruition. And so that's why, for me, what why Spark is interesting, because I mean, coming here, I mean, as a technical person, I appreciate, you know, the kind of the keynote that Mattai gave yesterday, you know, what is Spark 2, all that kind of stuff and some of the new things coming. But from the perspective of a business user, for me, I see Spark as plumbing. That's all, I mean, that is what it is. But it's special plumbing because it enables, eventually, once the rest of the software providers, like us included, once we've sort of figured out how to take advantage of that plumbing, we can do some pretty incredible things for the business user. So when I mentioned two-speed IT, the notion of being able to take, you know, many different kinds of data, shop for it, and then build data products in a more iterative way. So that's where I see this ultimately going in this notion of users search and shop. They might do some limited modeling or even playing around. And none of those users should not know what Spark is even. So where do you think the disconnect came between the PowerPoint slides promising the insight jumping out of the lake and the reality? Was it over-high vision? Were the tools not just ready? Was the expectations wrong? Are we finally getting there with Spark? I mean, why that disconnect? And are we finally going to get to the point where we bring those vision, you know, the vision and reality together? Yeah, I think the disconnect largely comes, I mean, it's like any hype cycle where people get very excited about something that is now possible that wasn't before. And so people maybe get ahead of themselves a little bit of thinking, well, if I suddenly do this with my transactional data, that kind of thing, then we could do all kinds of things. But then the questions that often get ignored in those early discussions are about governance, really. I mean, the various aspects of governance. I mean, especially when it comes to regulatory compliance. So there are, I mean, we deal with these companies all the time, your banks, airlines, retailers, and so on at a national or international scope. And they have very serious concerns and requirements to satisfy the regulators and privacy requirements and all the rest of it. And the fact is that a lot, one of these projects come out of open source or even emerging projects from non-open source companies, that governance philosophy isn't baked in at the heart. To be clear about governance, it's not just the lineage of the data, but it's like, so every transformation that happened on the data, it's the code that did it, so you can verify or audit it. That's right, and you can audit it. And so lineage is one element, security is one element. So imagine a problem where you have, like me, the customer, I go to a bank and I have my normal transactions and all that sort of thing, but I might also have a mortgage with the bank. I might also have some kind of other social engagement. I might call their call center. So information about me, the consumer, is spread all over the different warehouses in that bank. And so there are some regulatory requirements that require people to, they allow access to a lot of those individual pieces, but they do not allow people to synthesize them to build a full composite picture for various privacy requirements. And so technically that's a really, really hard problem to solve. And that's, I mean, when you're looking at the kind of plumbing type conversations, such as what this conference is focused on, which is the Spark Engine, MLive, and all that sort of thing, those engines, they're not about governance. They're about how to get work done. Yes, Spark doesn't help with that. No. So tell us perhaps the scope of the problems or what did customers who were successful with their early data lake journeys, what did they do? How did they approach the problem? Yeah, so where we have seen success stories is where there was a very specific problem and maybe it relates to scale, maybe it relates to one kind of set of data or maybe two sets of data that emerge together and people try to do some work at scale. And when there's a business problem tied to that, that's where we've seen success because then the investment is worthwhile and some of the engineering pain that's required to do things that are new or to learn new languages like Scala, for instance, or Java MapReduce, where that actually makes sense, where there's a return, where people can actually see how does this fit into kind of our greater business journey effectively. And so that's been successful. But in terms of what are the keys for greater success? What's going to make this stuff take off and actually deliver some of that transformative change that's been promised? Well, that has to, that starts and ends with a governance conversation. For instance, can the users, can the constituents of the data that require, reports being generated, questions being answered, all that sort of thing, are they able to be audited in a way that makes sense? Are they able to look at that lineage story that you just mentioned, that sort of thing? And when they're finding the ROI, where are some of the places they're finding the ROI that you see? We just hear repeatedly over and over and over the anti-churn message, right? It's like, all right already. And the false negatives. It's actually the false negatives. Or no, it's the false negatives actually that piss people off on the credit card transaction because I'm trying to buy something and you're not letting me do it because you think it's a foreign or a fraudulent claim. What are some of the other places when you talk about ROI that people can find some easy returns on some of these investments? Yeah, so there are, I mean, the examples you gave are out there. So, I mean, one we often come back to is the wind farm company that we work with in Denmark, Vestas. So they had a problem where they have all these wind farms and it costs a ton of money to install an individual windmill. And usually those windmills are installed in clusters, optimizing for wind patterns and all that sort of thing. And they have millions of centers around the world. So it's almost an early IoT kind of use case. But in order for them to accurately place the wind farms in optimal spots, because again, you make that decision once, right? And then for the life of the windmill, which is about 20 to 25 years, you have to be able to, I mean, like if you place it even, you know, five meters off, I mean, that has, you know, millions of dollars of impact of potentially lost income from the electricity that was generated. And so what Vestas was able to do with the Hadoop software that we sold them and helped them with, is basically grouped down from, you know, from like a square kilometer basis to almost, you know, like about 10 to 20 square meters, right? So they could have a far more precise indication of where their windmills need to go. On the placement of the windmill. Exactly. Before you even get into optimizing it once a thing is up and running. Precisely. And that optimization, that's another question, like how do we do preventative maintenance on these windmills? I mean, I remember as a kid driving over the hills going to California, and you see acres and acres of windmills and half of them aren't running. Aren't turning, yeah. Exactly where all of them all pass. The other example that comes up sometimes is really when you can start making economic decisions around, you know, do you over throttle the thing? Because right now the price of energy is high, even though it's going to impact your maintenance schedule. But you know, you can now make an economic decision versus making a pure decision based on no, you know, don't overrun it. It's going to cause us to have to bring the thing down earlier. So, and you said they have millions of sensors? Yeah, around the world. So they, yeah, they track like air, barometric pressure, wind speed, wind direction, how many bursts of wind there are, because I mean, wind can be too intense for a windmill before it fails effectively. So there are all these factors that kind of go into the placement. And again, that granular, you know, the more specifically you can group that specific area down, you know, that means significant return. So that's a scale problem because all those sensors, but that data coming in, A, storing it, and B, being able to analyze it and calculate things, that's a big data problem. So then two customer sort of journey questions. The ones who do have this successful experience, what's their next stop on the customer journey? And then the ones who stall at the data lake, where do they go from there? Yeah, so as I mentioned before, it all stands and falls with, you know, what is the business problem that I'm trying to solve? I mean, I've gone into dozens, probably more customer engagements, where there's a team that's very excited about the next hype, the hype technology. And trying to, I mean, there's a bank actually, just two days ago, I met with, where they want to use Spark, but for web applications, and they want to figure out how do I fit the Spark kernel, the Spark engine, inside like a web servlet? That doesn't make sense, right? Just for kicks, they just want to try it out somewhere. They see Spark as a multi-purpose, they want to take advantage of it in memory, but that's not the point. The point is distributed workload larger scale, right? And so this notion of jumping on the latest type bandwagon is not always fruitful. So having, well, step one, recognizing what's possible now that wasn't before with whatever the technology is, like Spark, for instance, and there are lots of things that are possible now with Spark that weren't before. Especially when you look at statistical processing, being able to do those kind of algorithms and having the data explode in memory and being able to distribute that, that's enormous. And it almost hasn't been reported on enough, I find, like the potential that that has. And then what are the problems that are solved by this thing that wasn't possible before? Right, and then what are the solutions, right? And IBM's made a huge investment in Spark, continues to make a huge investment in Spark. So are you going into that bank and saying that's an interesting use case, maybe not the best, but oh, by the way, here's my portfolio of banking solutions that have integrated Spark or leveraged Spark for a better RRI. Yeah, and so that, I mean, your response and the question that you asked, how do people kind of overcome like this kind of miss or this unsatisfied hype kind of thing, is kind of where the conversation with the bank went. We asked, so why do you want to use Spark for these web applications? And then we said, their response was, well, we're kind of tied to IT, we want to be loose of that kind of back to the two-speed thing I mentioned initially. And we said, well, have you considered cloud? I mean, so we have, I mean, and us and others as well, have Spark offerings running on the cloud where you can use this technology in a setting that makes a lot more sense, where you don't have to embed it in a servlet, where it's just running and wasting a lot of resource, trying to paralyze things that can't be paralyzed because you're running in this tiny little container, but running it on our blue mix cloud, for example, and actually having a meaningful data set behind it that can scale out, well, that might be a viable option. And zero IT, it's really all about having the developers work on that. So when you look at the core requirements, sometimes these things flesh themselves out quite nicely. Yeah, well, I was going to say what's the last word, but I think, I don't know, you could do much better than that there. So thanks for stopping by, are you going to be at Interconnect? Oh no, we already had Scarlet. No, I'm not at Interconnect. You're off to Singapore. Yeah, a much longer flight than Vegas from here, but we'll save travels and thanks for stopping by theCUBE. Thank you for having me. Absolutely, I'm Jeff Frick. We are live in Midtown Manhattan at Spark Summit East. Thanks for watching. We'll be back with our next segment after this short break. Thanks. All right.