 brought from Galvanize, San Francisco, extracting signal from the noise. It's theCUBE, covering the Apache Spark community event brought to you by IBM. Now your hosts, John Furry and George Gilbert. Hey, welcome back everybody. Jeff Frick here with theCUBE with George Gilbert. We are live in San Francisco at the IBM sponsored Apache Spark community event running concurrent with Spark Summit, which is running right across the street. We're at Galvanize, which is doing a lot of education, a lot of work to increase the population of data scientists, which we know is a big gap in the needs out there, so good work for them. And we're excited for this next segment to be joined by Robert Parkin, Principal Scientist, IBM Commerce. Welcome. Hi, thank you. So we talked a little bit about off-camera, you're the guy that's riding the motor underneath the covers. One of them, one of several, actually, one of a large team. Excellent, so talk a little bit about what you're doing and talk about how this whole Spark thing is really a game changer. Sure, so IBM Commerce really has three different parts. There's a part that does marketing, a part that does merchandising, and a part that does B2B kind of commerce. And what my team does is really data science. So we're going under the covers of each of those applications and we're adding an analytics, both predictive and optimization analytics to try to improve users' interaction as well as the decisions that get made inside those applications. So a lot of conversation about interactive and interactive analytics and really enabling more of this kind of back and forth for people to explore hypothesis, get some feedback, dive down, talk about that and how that's so much different and more powerful than kind of the ways that they did in the past. Well, a lot of things are moving more and more towards real time. So what we've seen in the past is a lot of things with batch jobs, particularly in the Hadoop infrastructures. And that's great if you're running over very large data sets and you can wait 20 minutes to three hours for a job to come back. But more and more real time information is becoming important. And that's really what Spark is designed to do. It's kind of doing, for memory, what Hadoop did for disk. And it's allowing you to have access to the large size of analytics, but in a much more real time fashion. So you're able to do things with the data, actually when the person needs to make a decision rather than having to wait. So tie it back. This is, elaborate if you can a little bit on the marketing apps that, or the marketing perhaps frameworks that they're not completely packaged that IBM might provide, merchandising, B2B commerce. And then now with Spark as the more real time engine, how it changes them. Sure. So I'll talk a little bit about what we're doing in the merchandising space. So we have a number of tools that we've developed in the past around pricing optimization, promotion optimization and markdown optimizations. And what Spark is enabling us to do is it's really bringing down the run times for the mathematical modeling that goes on behind the scenes. So the way those applications work is that we get data in from the customer, it's transaction log data. So kind of line by line what you'd see on a receipt coming out of a grocery store. And we take that data and we're creating predictive models for promotion activities, for what people will do at different price points. And then we're putting those into the applications and allowing the retailers to go through and create essentially goals for each category of products. And then choose the right set of prices, the right set of marketing and merchandising based upon that goal and based upon what the consumers themselves actually want. But when you talk about the time compression, you're talking about from what to what. So with the merchandising applications, we're seeing 30 to 40 X performance increases. It's huge. And it's because everything's happening in memory now rather than back down on desk. And then how does that actually get translated in the business use case when they can operate at that speed? Is it increase the frequency in which they fine tune their programs? Is it increase the number of programs that they run? How's that actually realized for the key value delivery at the front end? Sure. So number one, we can model things more frequently. So the predictive models that you use, they're going to be more up to date with the latest set of data. It also opens up a whole wide range of things that we can do with data exploration. So before you used to have to have developers writing very complex queries just to get simple information out of a lot of these data centers, now you can load all of that into Hadoop as a data store, put Spark on top of it and using Spark SQL and other technologies that are packaged with Spark, you can directly query the data. Okay, so the 30 to 40 X, whenever you see order of magnitude or more means it's not just a quantitative change, there's a qualitative change. Sure. So, and it's always people process technology. So we've got the technical change down. What processes change and then how do the people have to reorganize around that? Sure. So data science just from its start has always been a very iterative kind of process. So you're always looking at the data, building a model, thinking of something new and rebuilding the model again to try to get a performance improvement or a prediction improvement out of it. This has really shortened those life cycles. So you have to do a lot more in terms of process for the quality control for models coming out because things are going live a lot faster than they used to. So that's one thing. The other thing is just the real-time nature of it is it's bringing everything to the front. So if there are issues or problems in predictive models that you may have had before, they're much more readily available. So you have to automate a lot of the human processes that were going on in the past in terms of looking at model forecasts and in terms of looking at optimization and prediction policies. And then in the marketplace, do you see the motivators more of a carrot or a stick? Is it more you better get with the program because your competitors are really upping up their game using these types of technologies or do people still have the opportunity to kind of get out front? I think people do still have the opportunity to get out front. Spark is still, it's been around for a few years, but it is still relatively new in industry. But it's definitely gonna become a stick pretty quickly, I think. People that aren't gonna be able to compute at this level of capacity aren't gonna be able to provide their customers with the kind of services that their competitors can. So it's gonna be definitely competitive advantage for those that get in early. And for those who don't get in early, what are the characteristics of their environment that you would look for? Would it be like fragmented data? Would it be sort of inefficient processes even just to populate the data warehouse, things like that? It really depends upon the industry that you're talking about. Fragmented data is still very much a real problem, particularly in the area that we work in with marketing and merchandising. A lot of the data that you really want is in different data centers, much less kind of different databases within a data center. We see a lot of that coming together though. As people move more towards streams of information rather than databases of information. So one of the projects that we've been working on is actually connecting all of the IBM tools across merchandising and marketing so that they can all flow data to each other alive. And that's going to take advantage, hopefully, of this spark technology on top of that, to be able to do real-time analytics as the data's actually flowing through the systems as opposed to storing it before you're doing analytics on top. Okay, go back again to the, which of those, what are the repositories that you're now flowing instead of storing? And how do you sort of either correlate them or make sure that they match up? So there are different data streams that we get for different applications. So for merchandising, I talked a little bit about that transaction log data that comes through. If you're looking at marketing applications, there'll be things like email campaigns and responses to email campaigns. Online, you're looking at things like DSP bid auctions where people are bidding on ads on the web. All of that is data that you can now flow through an exchange that we're starting to create that crosses those applications. And then you can layer in Spark and a few other technologies like Kafka on top of Spark and start doing calculations on the fly as the data's actually moving through the exchange. So again, talk about how that completely changes the game. I love the way you say that it's a stream versus a database. And the way that now you start to think about how you build your applications and the actions you can take from your applications. And as you said, to get predictive but even another level and prescriptive. I mean, how are they really taking advantage of that? And is it just a quantum leap in a way to think when you've been really working on a batch, delayed mode to actually even to contemplate what do I do with this stuff that's coming in real time? So on the commerce side, everyone's been talking about personalization for a long time, right? That people want a personalized experience when you're talking about marketing, you're talking about merchandising. That's becoming more and more the case. So because I can look in real time and see what you're doing across particular channels. So what you're doing on your mobile phone versus what you're doing on the web versus if you're calling into the call center. I can take that information and I can tailor, I can allow the retailer or the brand to tailor their interactions with you based upon what's happening real time as well as what's happening in the past. So if you call into a call center for a company, the call center person can instantly pull up the last five or six interactions that they've had with you and instantly have context so that they can better serve you in that call or with the next purchase that you're going to make. Yeah, and the other one, I forget where we're showing this, you know, if that's done well, it's magic, right? If it's done poorly, it's creepy. It's weird, right? Exactly, yeah. But this becomes, I mean, I gotta go back to this famous example where when we went from steam power to electric power in factories, it was the productivity advances were so enormous, but it took decades for the stuff to get deployed because one economist went out and found that when it was steam power, like the factories were built tall because there was one rotational shaft and everyone had to draw their power off that and you had to basically knock the factory down and put the equivalent of like assembly lines or things like that. I mean, the things that hold our data right now, these old operational systems, it's not that easy to pry them loose and sort of remodel the data. So you're talking about, okay, yes, we can stream this stuff and it used to be in databases, but tell us how hard it really is. It's not easy at this point, but we're making steps in those directions. I don't think that data repositories are ever gonna go away. You're always gonna need some historical store of the data, but the more that you can kind of do live in real time, generally the more relevant the decisions you're gonna be making are for the customers that you're trying to serve. To your point about the factories, I mean, that's a very vertical structure that you talked about. What this technology is doing is it's actually spreading the data across different structures, right? So you can load the data into Hadoop if you wanna do things in batch and then stream them through Spark, if you wanted to use Spark on top of that. So I see it more of it breaking down those barriers than it kind of adding to the barriers that are already there. Oh, you mean it's because you don't need just, you don't have to hold just one copy of the data. Yeah, and it doesn't have to be structured, right? It doesn't have to be structured in a formal database format the way that it had to in the past. You can just dump it in to a flat file or to Hadoop or however you wanna store it and then read that context into Spark and it automatically knows what to do. I'm curious to see too how the perceived value of the data stores, the data sets changes, right? Cause I would imagine that they would thought that the historical transactional data that's been in the ERP system forever is the high value data. And Robert, what are you telling me that I should be paying attention to Twitter streams and Facebook comments? But with the temporal element added, I would imagine that there's a value shift significantly towards streaming, if not real time, near real time actionable data that you can do something about versus really more kind of a historical picture of what's going on. I think that's absolutely the case but you do need the context of the history. So you need to understand where that particular customer is coming from in order to be able to serve them well in the moment that you're in right now with the decision that they need to make. So we've heard this theme before. You need history and real time and the idea is, well, one of the technical implementation issues which is important is can you deal with both with the sort of same programming model or the same platform which from what we understand, Spark helps facilitate. Absolutely, yes. But when you're going into an organization, let's say you've got CIO or VPN merchandising listening and he says, okay, this sounds great but I do have my old systems of record. How do I start? What are the first few steps? It really depends upon the industry that you're in. There's a lot of other things you have to consider, right? There's privacy and security which I think you hinted on a little bit earlier. That's a major concern that we have. This isn't gonna be something that happens overnight but we think within the next five to 10 years this is really gonna take off and that's why IBM is getting behind it in a very, very big way. Right, right. And what do you define, this always comes up real-time. What do you define real-time as? Again, it depends upon the context. So if you're talking about things like, we talked a little bit about DSPs, serving ads, real-time there is 10 to 20 milliseconds. It's very, very fast. Real-time and kind of a brick and mortar retailer could be daily. It really depends upon the context of the decisions that people are trying to make. We always like to say real-time like you haven't, they have time to do something about it. That's kind of necessarily that instant. All right, Robert, so we'll give you the last word we're getting towards you into the segment. What are you most excited about? What are you waking up in the morning, just knocking down that coffee, charging out of bed that's really getting you going? Well, we're really excited about some new products that we're developing using the tool. So one of the ones that I'll talk about actually at the conference today is a product called Journey Analytics. And it actually helps retailers and brands connect in a more effective way across the different channels. So across mobile, across web, using the real-time data that I talked about. I'm also personally branching out into new areas. I've never done anything in finance before. We're starting to do things in payment analytics that can now be real-time. Because bank payment systems are now moving more and more towards real-time away from batch as well. So that's a whole other huge industry that's opening up and we think that Spark could be a big part of that technology stuff. And then clearly just the interaction between any customer and their provider is more and more electronic. It just continues to be more. It's funny, you brought up Call Center a couple of times. I don't know if I don't want to be in the Call Center business these days because everything is moving. Self-service, we had a quote at another show where somebody set up their kids, set up their insurance and their car and they said, dad, I don't want people, they just get in my way. If I can't do it, self-service on the web, I'm not doing it. Well, that's exciting stuff, Robert. So thanks for stopping by and sharing that good info. Thank you. Absolutely. So I'm Jeff Frick, you're watching theCUBE. We're at the Apache Spark Community Event that galvanized right across the street from the Spark Summit. We'll be at Spark Summit tomorrow. I'm sitting here with George Gilbert, my co-host from Wikibon. Thanks for watching. We'll be back on the next segment after this short break.