 From San Jose in the heart of Silicon Valley, it's theCUBE covering Big Data SV 2016. Now your host, John Furrier and Jeff Frick. Okay, welcome back everyone. We are here live in Silicon Valley for Big Data Week, Big Data SV for Silicon Valley and also Strata Hadoop happening right across the street. This is theCUBE, Silicon Angles flagship program. We go out to the events and extract the signals from noise. I'm John Furrier, my coach Jeff Frick for this segment. Our next guest is Steven Sitt, who's the director of product management for Hadoop and Spark at IBM. Welcome to theCUBE. Thank you, John, nice to be here. So obviously the center of all the action is Hadoop world, renamed Strata Hadoop, but really it should be called Big Data World because essentially it's about the platforms and the apps. You're seeing the shift in the power of the cloud really driving this and we kind of predicted this last year. Soaring, seeing it play out. So Hadoop has always been central for storing data but now Spark has been very, very hot. We covered Spark Summit, East, we got Spark Summit, West coming up. A lot of action in Spark and it's almost like the little brother comes out and grows up faster than the older brother, if you will, with Hadoop. And there's a lot of interesting kind of posturing around Hadoop versus Spark and one of the things that Joel talked about earlier was the misconception that you have to have Hadoop in order to run Spark and there's some sort of sequence. What's your thoughts on that? Is he right? What people think? Is that what is required? What's your thoughts on that? Yeah, definitely. So first of all, good to be here. Thanks for inviting me. So yeah, so Hadoop is the 10 years anniversary already as we all know. So as we first looking at the adoption of Hadoop, you may start it mostly for cost reasons, right? There's a mass amount of semi-structured data, particularly some unstructured data. It's not cost effective to store in the traditional data warehouse environment. So people put it in Hadoop, right? And then on top of that, they run analytics. So that's how Hadoop was actually started and adopted in the early years. And over time, you know, just like a few years ago, we started to see the emerging pattern around data lakes and the way that actually used Hadoop to kind of augment the data warehouse environment, right? You bring all kind of data in the common area. You do ELT as opposed to ETL, right? And then you clean the data, considering different data sources, including unstructured data, and then feed that into your front-end warehouse environment to deliver business reports and applications. So that is a very common pattern that we're seeing. You know, of course, there are still some difficulties on that kind of architecture, right? Primarily, the interactivity is one issue, right? The speed, for example, right? Because Hadoop originally was more built on a batch-oriented approach, right? So the outcome of that is a great environment. Many of the cooperation naturally built. But getting wider adoption from the line of businesses, at least based on our experience, has been somewhat challenging, right? So, yes, augmenting the warehouse is fine, but we want to open up the data for access for data scientists and business users to leverage that environment. It's been challenging because there's not enough interactivity or tools that can help them with that. So is it a tooling platform issue, or is it more of speed to value? I think it's a wide range. I think it's both, right? I think it's both. And as a result of that, and people still doing, at least based on what we've seen, people still doing a lot of silo applications. People like spreadsheet. Many people live on spreadsheet, and that might be hitting on their laptop, and still doing on the database sitting under their desk, right? So there is a lot of that issue. But that also, of course, is a culture-general issue as well. So we certainly believe that to get big data to the next stage with a wider adoption from the conventional business organizations, it needs to have something more interactive, more consumable. And Spark certainly is a key part of that as we start seeing it picking up the pace in the last two years. And that's the main reason for us to really invest in Spark and build on top of that setup, too. What's interesting, Steven, I want to raise your thoughts on this, is because the political landscape within the community is certainly changing. We're hearing in our community, and certainly heard it in the hallways last night. And you mentioned the line of business that validates your point. And it's interesting, because the lines of business have really fun, exciting projects that they're working on. And it's forward-thinking. It's forward-facing. It's revenue-generating. It's application. It's sexy. It's fun. And then what happens we're hearing in the hallways is there's too much back-office mangling of clusters. There's a lot of heavy lifting going on behind the scenes on just base management of Hadoop. And in some cases, Spark, but mostly Hadoop, it's just too hard. So the inhibitor is speed. Speed is just it's too hard. And I can't find talent. Is Spark having the same growing pains that Hadoop's having? And what's making Spark so successful in your mind? Is it the fact that it's easier integrating? And then the final question, I'll get to the data science questions and answer that first. The role of speed, Spark, and the hardness of? Yeah, I think certainly I do want to keep credit to Hadoop. It's a great system. It's very important for us, strategic for us, and our product is built based on that. I think what Spark brings to the table is really about not only the performance. Performance is one thing. The in-memory computing, all that is great. But from my point of view, the other key value that it brings is that unified the set of API, if it will. And allowing really the data engineers in the cooperation and the data scientists and application developers to more seamlessly work together on a common goal. And you're seeing that with the app side. We are seeing that in the app side. And many of the line businesses, the customers that we work with, they're saying data warehouse augmentation is great. How do I leverage the data so I can deliver new applications that can fundamentally transform my business? And to do that, you really need to take these three type of prayers with a new system that can work together. Now, Hadoop is very scalable. It's great. But on the drawback side is it has all of these different pieces in the ecosystem, which has different set of APIs. And different set of tools. And as a result, it kind of force these prayers to kind of work somewhat in silo, right? Now Spark gave you the opportunity to kind of tie them together. Unify them. OK, so I know Jeff wants a question, but I want to just kind of continue the thread because now you throw into the mix the role of the data scientist. And what I'm observing is in talking to folks and seeing it over the past year is this data scientist is stuck in the middle between the Hadoop community, which has always been a big part of this show, Hadoop World. And what Spark is, because the data scientist just wants the data available. They want to do R and or Python, whatever tools they're doing to manage the data. But they're leaning more towards Spark and the line of business because that's where their action is. They don't want to get stuck in the mangling and the setup requirements of the disparate back end stuff. Yes. Are you seeing that? And what's your thoughts on how the data scientist, because they're an important part of this? Absolutely. And I think that's one of the building of Spark from our point of view, right? It is really not limiting the innovation of the way that the data scientists actually work. If they prefer to work on a smaller set of data to build their models initially on their laptop, which is perfectly fine, right? Or if they say I have a critical set of data that I could really use is what we call, for example, the system of records sitting in, like, I've been mainframe systems, right? I should be able to submit a job and actually run it there and just see how the results will come back, right? But then Spark also give them the opportunity to say, OK, we do have the data lake environment that all the data can also bring together. And you can run, you can test your model in the mass amount of data in the real cluster environment that potentially can run longer duration. But you get a much, the result is you get a much more accurate models that you're working on because there's more features you can work, you can leverage and more data that you can leverage, right? And that can enhance the application. So that's a lot more freedom for the data scientists is how we see it. OK, so I got a tricky question for you. So you're at Fortune 10, and you're talking to the CIO, big IBM client, and the CEO walks in. He says, Steven, explain this for me for the CEO. So I got this Hadoop thing I keep hearing about, the Spark thing I keep hearing about, and I've been spending a ton of money on BI for years and years. How should I think of it? Again, the CEO, quick and dirty. How are those buckets, horses for courses, and what does each one do best? I think you've got to think about the end results of what you want to accomplish. That's really, really important. Many of the CEO's agenda is really about how do they transform the new imperative objective strategies that they want to transform the business. I think that they shouldn't focus too much on the infrastructure side, the technology side, unless they have a very clear view on where they want to go. And that many times translate into new applications that can serve their customers and partners better. Some innovative way that we see it all the time, like the UBA and Airbnb. Those are good examples of all of a sudden you have a new business model fused by the technology, kind of disrupt the traditional business in big time. So those are the things in the mind of CEO. We believe that in that conversation, we need to help them to really see what you want to deliver from the application standpoint and how the back end can help you to get there versus starting at the technology to say, OK, I got to build this Hadoop environment. I got to use Spark. I got to use a whole bunch of other technologies without kind of like a clear goal in mind. But is Spark then the enabler of the transformation via apps that Hadoop nor traditional BI ever was? We certainly believe so. And it's especially in enabling the data scientist part. Because we believe that the future application is all about intelligence built into the application. They've got to be smarter than what most apps are today. So in order to have the intelligence built into the app, you need to have the data scientist and the rest of the infrastructure and the application development team work hand in hand in the very agile approach. Because we know that generally how data scientists work, right? You have a problem. You collect some data. You build some models. You test the model. You deploy it and see how the reaction is. And if it's not good enough, you want to probably bring it offline, bringing some new data, finding some new features, enhance your model. Think about that whole process without the agility and the speed to do that. It's going to be very hard to iterate through to deliver the right application for your customers. And Spark? It begs machine learning too. As you're saying, smart applications, a new class of applications. You think, well, machine learning, Watson, letting the machine help modify the application based on behavior as well. Absolutely. Yes. Talk about the last couple of minutes we have here. I want to just get your thoughts on your priorities from a product management standpoint. Obviously, Hadoop and Spark, I'll see the priorities for your products that you're managing. What's the priorities for you? What are you working on? What can you share with the folks out there from an IBM standpoint? Quickly share what you're working on. Sure, absolutely. So in terms of our strategy, we really have two parts. The first part is this ecosystem of open source capabilities, including Hadoop, including Spark, and any upcoming capability as well in open source is really great. You have the communities. You have very smart people working on it, from not just IBM, many other companies as well. So certainly, we want to leverage that, but we are also contributing to it. So even from the early days, we have IBM Research technology. We have the start with a system. We have System S, System ML, System T. All those great things from IBM research, we're contributing a significant part of that, like System ML into open source. And we also have our STC, the Spark Technology Center, which we just opened up. And by the way, we only did it twice before, like for Linux and for Java. Now we have STC for Spark. So the folks there, their job is to enhance open source and make Spark more robust. So open source as the base for our foundation is a key part of our strategy for the platform. And our second part of strategy is to make sure that our capabilities that sit on top of the data platform will leverage the open source and especially Spark. Yeah, that's the decoupling you're talking about, making it flexible. Making it flexible and making it able to leverage in all the great innovations, including our contributions, that's happening in open source. So we have, you know, as you know. It's a great hybrid strategy. You get the open source effort. You're fueling that. It lifts the market up. It's collaborative and there's no real agenda other than supporting more great open source and then having your products add value to that. That's pretty much IBM strategy, right? Yes, yes, yes. Okay, so speaking of Hadoop and Spark, we will be at Hadoop Summit in Dublin next month. The Cube is going to Ireland. So we'll be there having a few pints of Guinness, talking about the batch in real time, Spark and Hadoop. So even thanks for sharing your insights here on theCUBE. Appreciate it. Pun intended. Thanks very much. That's your product, Insights at IBM. This is theCUBE bringing insights and extracting the signal and the noise. We're here live in Silicon Valley for Big Data SV and Strata Hadoop. We'll be right back with more after this short break.