 on the galvanized campus in San Francisco. It's theCUBE, covering Apache SparkMaker Community Event brought to you by IBM. Now, here are your hosts, John Walls and George Gilbert. And welcome back to San Francisco here on theCUBE. We continue our coverage, our week-long coverage, basically of Spark Week here in San Francisco where the Apache SparkMaker Community Event be here all day today, then over to the Spark Summit we go for Tuesday and Wednesday coverage live streaming. So be sure to stay with us all week here on theCUBE. I'm John Walls along with George Gilbert and we're joined now by Rob Thomas, who's the VP of Product Development for IBM Analytics. And Rob, thank you for being with us. Good to see you. Great to see you guys. Yeah, tell us. Great to be here for the week. Set the stage here for us just in general about IBM and the Spark community. And it seems like a very short year ago, right when you really caught fire with this and got that whole heart of endorsement and complete engagement. Tell us about the year and where you think this is headed now with all the energy you have behind it. It's been an unbelievable year. And last year I talked about Spark as being the analytics operating system. And I said at the time, we thought this was the most significant open source project for IBM and enterprise IT since Linux. And frankly, my conviction of that has grown in the last year as we've seen what's happened in the community. Contributors, committers coming to the table. Enterprise is adopting Spark. I mean, the number one question I get is not should I do Spark or not? It's how and how fast and which direction do we go when I'm talking to clients. So it's been a year of huge progress but there's still a long way to go. And when you talk about the progress that you have made then, I mean, what's inspired that you think the most? Is it, you know, combination of early success and proof of concept and then competitive energy as well? Or what's really been, you think, the big instigator of that, of what you've done in the past year? One thing we did was we launched a Spark customer council about a year ago, which is basically about 30 clients from a variety of different industries. And we get together with them quarterly, sometimes more often than that. And we started asking them, what do you guys need to see in the Spark project? What's slowing adoption? What are you trying to speed up? And you do that over a period of the year and you start to see a lot of patterns of what people want. So people definitely want Spark SQL in terms of they want a simpler way to adopt their SQL skills to the Spark ecosystem. They want machine learning. Machine learning has been talked about now for 50 years. I think the phrase was coined to go 50 years ago and it's just now coming into, I'd say, relative mainstream in terms of uses. So people want machine learning and then people want, I'd call it the basics, the stability of the system, that type of thing. That's where all of our contributions have been. We're now the number two contributor to the Spark project. And most of it's in those areas around machine learning, around Spark SQL, because that's what clients are asking for because they want to bring this into their enterprise, but every enterprise has a bar that things need to get over in order to come to the table. So this is a case then it sounds like market driving the technology and not the other way around that. I mean, you're hearing what users want, what they need, where the gaps are, and then you're going about trying to fulfill those needs. Which is very different than a year ago because I think a year ago was really about technology driving where this was going. That's something that we've embraced, but one thing I talked about last year here at Spark Summit was we need to start to bring this to the business community, which means getting to that next level of requirement. It was interesting to hear you say that, people want Spark SQL and ML or machine learning as top priorities. Let me ask you to slice that a different way in terms of use cases, because when we talk to people with internet of things use cases, they need higher performance or lower latency streaming to be technical, like evaluate each data item perhaps as it comes in on a small edge device, whereas it sounds like what you're describing is up in the cloud, let me make sense of the huge torrents of data coming in that haven't been filtered. Where are your customers, not just among those use cases, but sort of are they talking about we wanna now sort of replace MapReduce in Hadoop or what's the sequence of dominoes that you have to knock over in a little more granularity? Let's talk about the dominant use cases for Spark. One is around real-time, and this is companies that are looking, they want to provide self-service access to data, and that has to be done quickly. So real-time is a big use case. Another one is around what I call 360-degree view of customer and customer relationships, getting from marketing to markets to marketing to individuals or personalization. That's a dominant use case. And then to your point, George, around IoT is a very dominant use case. We open-sourced Quark's earlier this year. Quark's is basically an embedded runtime of our stream engine, in the first year of streams. Team PIBM streams. Right, exactly, which we open-sourced as an embedded component for IoT use cases that data could be streamed back into a Spark environment as an example. So those are the three dominant use cases that we've seen, and it's all about how do you enable those for actually getting to business outcomes, because I think we're past the days of people just looking to open-source to reduce costs. People are looking for where's the value, where's the higher-order use case, how is this gonna grow revenue, how is this going to expand my customer relationships, how is this gonna reduce churn, and that's what we're doing with Spark. So let me ask, when you take these ingredients and a customer, you have, let's say, on the back end, you have the IBM Bluemix Cloud and all the data services that now make a platform instead of a set of individual products. The customer comes to you and says, help us implement this, if Internet of Things use case, but obviously I need more than just what's on the edge. I need something up in the cloud that makes sense out of all this. You have templates or experience from other similar customers. How do you bring that to bear? What does that look like? Well, one of our core tenants in IBM Analytics is make data simple. This is a problem that has really, you look over the last 20 years, it really has never gotten easier. It's actually, unfortunately, it's gotten more complex every year as we move from databases to warehouses to Hadoop. It's gotten more complex. Our mission is make it simple. And so I'll give you an example. In IoT use case, we do a lot of instrumentation of cars and you hook a car up with something like streams where you're constantly looking at data coming off the car. To bring that back to a homegrown environment of stitching together a variety of proprietary products, that's incredibly difficult. So we take that complexity away, but we're delivering on our cloud platform is a set of composable services that says, all you really have to do is get the connected pipe, the real time streaming to the vehicle and we can handle the rest. We will put that data right into what we call a fluid data layer. Data is ingested at a very high speed. It's landed an object storage. Then you can choose if you wanna use Hadoop or a warehouse or some other type of capability. Spark is the processing engine that runs across the top of that. We are making data really easy to interact with so that companies can focus on so who are the users they wanna serve? What are the tools they wanna provide their users? They're not necessarily having to worry about the plumbing. Okay, let me take you up perhaps one level between the plumbing and the app. And I asked this to Bob Picciano. It was just about the, it was just the day I think at Insight last year where you announced buying the weather company which you also tipped to us was not just about weather but about an IoT ready platform. Whether it's connecting cars or other forms of transport, mobility, whatever. Are there models that you use that you've used with other customers that you might leverage or enhance in doing a solution for the next customer or is that considered the customer's IP? Well, so there's a difference. So certainly in terms of when I talk about this fluid data layer, integrated services, that's for everybody. When you get the next layer up the stack where you're looking at actual data models, actually data assets, that's obviously unique to a customer and we protect that diligently. But there is still something to be learned just through going through that process in terms of how people build models, where they get value out of this. And I think we can bring that expertise to the next client without sharing any intellectual property. Accenture talked to us about, I asked a similar question to, I forgot what his role was, but he said we create an analytic data record. And that was their unit of repeatability. And listening to you, I'm sort of parsing this but it sounds like you don't have that exact same standardized template. You'll just say we have knowledge of how to build the model that applies to transport but we're not gonna build the exact same thing for the next guy. Yeah, I mean, every use case is unique, right? I mean, what we bring in terms of consistency is we'll do data governance in a consistent way, how we create your metadata, how we catalog your data and we can learn a lot through doing that so that it's very repeatable. But we're not trying to share what one client does with another client as an example. Interesting, so in other words, a customer who believes that they wanna work with you to capture or implement something that's a strategic advantage, they would likely choose you because you're not really, you're trying to have domain expertise but not actually share the IP. Yeah, I mean, I would guess that anybody would struggle to actually share IP for one client to the other so they do something you have to be very careful about. But I think the reasons client chooses is that they're looking for expertise in implementing the use cases. You mentioned the weather company, part of our attraction to them, in addition to the tremendous people they have is they had built one of the only commercial internet scale platforms that we've seen outside the likes of a Netflix or Facebook. I mean, in terms of the events they process, the locations, I mean, this is at that level of internet scale, that takes years to do. Team had done a tremendous job of it. So being able to take advantage of those types of capabilities as we build on our cloud platform, this would be like Facebook saying every other social media company can use their platform. They would never do that. But we're able to do that because we're serving enterprises and say, look, anybody can have access to this internet scale commercial platform, that's pretty powerful. You were talking earlier about complexity and making data simple, that's the mantra. How do you do that today with all these inputs? You've got mobile and social and embedded sensors and the internet of things and you have all these massive amounts of data coming in and then with a variety of outputs. You're trying to, you're customizing it for different clients and so why does Spark allow someone to do that better than any other computing framework out there? What Spark does really well is it gives you universal access to data. So the traditional IT model has been a very rigid stack of we store data in this place, we hook up one application or one BI tool and you wanna do something else? Okay, now ETL that data over and now reconstruct the same thing. Spark gives you universal access, where that data is sitting in Hadoop or in a warehouse or in a database or in a mainframe. It could be anywhere. So Spark gives you that very powerful tool. Now, to your point on how you actually get to making data simple, this is about understanding personas or the users. And so we've done a lot of research in the market on what are the dominant personas? How do they wanna access and interact with data? So ranging from the data engineer, that's really more how data movement in organization to the citizen analyst, to data scientist, to people that just wanna develop their own data applications. Those are really the four personas that we've said that dominate the market right now. So our focus is how do we make it incredibly simple for those four types to access this broad data platform? Because we think if we can nail that, then we can add other users and we can continue to do that. But you gotta pick a place to start. And you mentioned, well data science we even touched on that. In terms of, you talk about a discipline or an area that maybe I don't know what, five, 10 years ago was elementary at the very least compared to where we are now. How has that been impacted? And where do you see that in terms of your focus in developing that community, the commitment, right? One million data scientists and looking to explore or to encourage that kind of expertise. That's pretty core it seems to what you're up to. It is. Let me give you an analogy. So I studied a little bit about Formula One and Formula One racing, what you find, looks very much like an individual sport from the outside. Just a driver driving the car. The reality is there's somebody that actually, there's an entrant that actually enters the car that pays the money. There's a, somebody that constructs the chassis for the vehicle. There's a pit crew that could be double digit type people. That team is what goes and wins a Formula One race. It's not just the, yeah, exactly. It's a huge team that comes together around this chassis. Now, the analogy would be, we kind of use spark as the chassis. That is the core thing that holds everything together. But then you have to bring different applications to bring the right precision, the right performance. And that's what we're doing with data science is we're trying to make data science a team sport. People think of data science as just the individual. It's the one person, they know one language. Our focus is how do we truly make it a team sport to move beyond just one person working on their own? Because that will unleash the power of data science. It's why it's remained a bit of a nascent market, if you will, because there's never really been that collaboration aspect. So much like the teams that went in Formula One get beyond the driver and really act as a team, we're trying to do the same thing in data science. Let me ask you about a couple of the assets you've brought together when built up. So if you take the weather company and its infrastructure, sort of the distributed analytics and data collection infrastructure, and then you take what Derek Shuttle talked to us about the analytic data services and then the data science sort of experience and really for the four roles, are we seeing all those come together as the new IBM platform? And with a long question, one last part, does it have the, I don't know if the right word is unit volume to compete like with an Amazon where the fixed costs are the design and then the running costs of the services? So I wrote a blog post a couple of years ago, or maybe a year ago, where I laid out what I thought the next generation data stack looked like. And at the repository layer, you had things like Hadoop and Mainframe and Warehouse. Next layer up was Spark, really has that analytics OS or execution layer. Next level up was around machine learning and how you build algorithms. Next level up was a programming framework. Next level up was applications. I think that stack is still valid and all of the personas that I talked about, they may not integrate or access every layer of the stack, but the stack is relevant to each of them. So that's what the next generation data stack looks like. In terms of scale, I mean, in terms of the scale that we acquired with the weather company is tremendous. Like I said, there's very few of the platforms that touch it in terms of volume. So we have a huge competitive asset with that, but being able to deliver that integrated set of this next generation data stack is very unique. I don't think you can find anybody else that has all of the aspects I described. Well, Rob, I know you spent time with us last year and probably, well, I know you made some predictions that actually were surpassed, I think, and in a positive vein. I'm not gonna ask you to make predictions for the coming year, but I'm sure that you share with us the excitement for the Spark community and where this is going. Absolutely. I mean, it's been tremendous uptake. We're really happy now to be number two in terms of contributing to the project. There's still so much more to do, though, and that's, you know, we're gonna have our customer council again later this week, get some more input on what we have to go do next. But, you know, as Spark is a nine-in-eam ballgame, I think we're probably in the second inning at this point. So it's still very early. There's a lot more to do. You still got your starter working out there. Exactly. Good deal. Rob, thanks for the time. All right, great guys. Good seeing you. As always, thanks for being with us here on theCUBE. Thank you. San Francisco, right after this.