 when you are watching theCUBE live in New York City for special presentation, Silicon Angles flagship program. We go out to the events and extract the signal and noise. Again, live in New York City, 100 yards away from the Javits Center where Strada had Dupas going on in conjunction with Big Data NYC, our event here. We're getting all the action covered in all the event. Our next guests are from IBM, Rob Thomas, Vice President of Product Development, Analytics, and Joel Horowitz, VP of Marketing Analytics. Guys, welcome back to theCUBE. Great to see you guys. John, great to see you. Thanks. IBM's all over in the news. Obviously we saw the big announcement last week. AI, Watson's the flagship. You know, everyone relates to the Jeopardy. I always roll my eyes, yeah, there's the Jeopardy, you know, but that's common. People can relate to that. Analytics is huge, right? Obviously you guys know that. We've talked about that. Hadoop is under the hood. There's some action going on that are powering the future of analytics, which what we talked about in our intro is the key to value, which is the decisions. They're asking the right questions. Automation. These are things that we've talked about in the past. So what is the big thing going on this year under the hood in the ecosystem from a technology perspective? Rob, I'll first shoot up to you. What is the key enabler now? Is it more confusing? Is it consolidating? Are people going to be spun off out of the ecosystem? Will people die from acquisitions? Give us the data. So my view, when I go talk to clients, I talk about a big data maturity curve that kind of says, look at what we've been doing for the last five years. Many clients started with, we're looking at data, and it's just about reducing the cost of storage. And then they started saying, can we extend our warehouse? And now we're kind of at the inflection point where now it's about line of business analytics and applications. And then taking that a step further, how do you start to build business models around data? So it's been a major change and we've slowly moved along that curve. So what does that mean in terms of technology? Hadoop is commoditizing. It's becoming the storage layer. That's pretty evident. The reason we've made such a significant bet on Spark while Hadoop is about storage, and that's an important part of the architecture, Spark is about analytics. So Spark moves you faster along that maturity curve. So when we get to talking about clients, like we talked about Independence Blue Cross, who's working with us on Spark, totally changing the way they do it, analytics of patients, of who is at risk to return to the hospital as an example. Spark is what enables that analytics. It's not about just storing the data. So I've got to ask you, should they just call this Spark World? I mean, they changed the name of the show from Hadoop World to Strata Hadoop. Strata being O'Reilly Media's show. But in reality, the conversations are not just Hadoop, it's beyond Hadoop at this point. I mean, did you agree to that? So I think they should change the name of it, but when the main sponsors are Hadoop-only companies, then sometimes you end up with that name. But yeah, certainly if it was an industry conference, it would probably be focused on Spark or analytics at this point, because Hadoop is, it's a key part. Less than 30% of companies have adopted Hadoop. So it's a great way to store data, but there's a lot of data in other places. So it's just a fraction of the picture. We're of age, we just put out some data just on Twitter, just saying the adoption trends are increasing pilots, but slow production growth. Are you guys seeing that, Joel? What are you seeing out in the field? I mean, that's essentially a gesture from Merv saying look at the data just as a support, rapid production at the scale people thought in terms of the TAM to Rob's point about the Hadoop vendors dominating the event. The reality is different. Yeah, I mean Rob's point is accurate. Hadoop is kind of becoming more of a storage environment. Although people sometimes miss name, how they call Hadoop, it's actually a much broader ecosystem as we all know. It's not simply a file system, it's actually a full community of a lot of different capabilities. And so I think what I'm seeing is that Spark is accelerating actually Hadoop and actually bringing a lot of the other data capabilities along with it like other NoSQL databases like CloudIn, DashDB that we have on Bluemix, a number of other key data services that are coming into the scene because of Spark's pull. So I think you brought up the fact that we opened Watson in the Valley and I think that's telling. I think you're seeing a lot of innovation happening there. It's the epicenter of Spark in my opinion, which is why we opened up the Spark Technology Center there. So it's exciting and I'm really excited because of what Spark is enabling not just the developers to do, but a much broader community. Rob, I want to ask you a question because I've been putting out this conversation thesis kind of out in the open, public kind of open source way as, you know, mutually exclusive notion of horizontally scalable versus scale up versus scale out or vertically integrated. There is a huge trend for diversity where the demand from the customer base is, hey, I want scale out because it gives me good price performance and I have our large scale systems to run stuff but I also want vertically integrated or scale up point solutions, whether it's engineered or not. They don't want to get into the nuances of the religion of scale up, scale out. So I got to ask you, when you talk to customers, how do you lead that conversation with the big data holistic picture and how does that relate to this thesis that we're kind of kicking around as, does it really matter scale up or scale out? Is that just going to be an integrated model? I mean, what's your view on that and what's the customer reality? So I don't know that that really matters from a customer perspective because where we take the dialogue, think back about the maturity curve I described. The value in line of business analytics and in creating new business models comes from machine learning, which is why we contributed system ML to open source. We're working with Databricks on that is because the analytics is ultimately about machine learning. What are you actually doing with the data? Scale up, scale out is, it's probably really interesting for the guy running the cloud environment but it doesn't necessarily change the outcome or the insight that a client's getting. Our focus is much more on how do we start to get to distinct outcomes? Joel talked about the announcements last week with Watson West. I mean, Watson is the preeminent cognitive system where you can start to glean insights. You bring human features, add that to machine learning. You can deliver totally different outcomes. And so I think that's where the discussion should shift to more than what's the architecture for the environment. That's important for us as we support clients but that's not the discussion I see clients wanting to have. Yeah, and the other observation we're seeing here is that Hadoop ecosystem over the past six years, we've been covering it. It just seems to be puttering along. I haven't seen a real game changing move from the Hadoop vendors now. You said it's more of an integrated solution with Spark. You're seeing that accelerated but now it's other things that are going on around Hadoop. So I have to ask you, one of them is machine learning, right? So everyone's bolting on machine learning into their announcements these days. So as a customer, how do I figure out who's got the real deal for machine learning? Rob, you're a technologist, so if I'm machine learning washing, just throwing machine learning onto my platform, say, hey, we've got machine learning, we're good. We're going to have neural networks, we're going to have all this cognitive science. So you guys have been there for a while. I mean, I'll see IBM has that. How does a customer decide who's got what? What's real, what's not? So you have to make it very easy for clients to get on board that train. So just this month, we announced that we've released a set of SPSS analytic algorithms that now run directly on Spark. And so what we take is thousands of data scientists that know SPSS really well, they can now run those algorithms on Spark, which opens up a whole different corpus of data for them to go after. That is machine learning, that is advanced analytics, but it's helping them take steps in that direction. Now Joel's leading some work for us, I'll let him comment around really a three day camp that we call Data Palooza, which will be about training the next set of data scientists. Joel, do you want to talk about that for a minute? Is there music involved in this? Oh yeah, there is. Okay, good. I think that my email, look at my spam folder, hold on, let me check. No, you're going to love it. We actually hired the band called Big Data, I don't know if you've heard of these guys, but they're awesome. And they'll be joining us on November 10th through the 12th at Galvanize, where we made our Spark announcement earlier this year. We have a full three day session with a full three day event. Where's that going to be? Galvanize in San Francisco. We have like 20 sessions signed up. It's a community event, it's not just IBM. You know, we have AMP Lab involved. We have, you know, TypeSafe involved. We have a lot of really great people who are joining us. And that's a certification or just self-training? Is it going to be teacher led, instructor led? Yeah, it's a little bit of both. So we are taking up the whole space of Galvanize, as you know, it's a big space. So we have the first two floors that are just going to be led by the professionals and the experts in the field talking about various topics in this analytics area. Let's talk about the strat. We were, we covered the Spark announcements huge. Yeah. Great event. How is that going and can compare and contrast or add up to that conversation in the streaming conversation? Because now software is all integrated Rob is teasing about machine learning. You got streaming, you got big data with Spark, you got the Hadoop for the storage layer. How is it all coming together? Give us the update. Rob, do you want to give an update on that one? So I'd mention one thing. So we're making a couple of big announcements this week with some new products. One is called Big Integrate. And there is big quality. And the idea is these are data ingestion engines that really change how you get the right data into Hadoop. And to your point, there hasn't been a lot of evolution in the ecosystem. It's true. We see clients struggle with this forever in terms of how they're building out data lakes. And so what we found is most data lakes are turning into data swamps, which is now there's a huge corpus of unusable data. And so Big Integrate and the big quality are about how you ingest data faster and how you ingest data in a known structure. So you actually know what's there. And it's in a form that it's usable. And so I see different pieces starting to come together. You're going to call it big cleanup. It effectively is. If you think about what Splunk has had success with, with the log files, that's catapulted their business. I mean, call it data exhaust, data swamp, whatever. Essentially machine log data is hard to cut through. I like that. Big Integrate and big quality equals big cleanup. It is though. Because that's where people are running into trouble along the maturity curve is they do it because they feel like it's the right thing. But then suddenly it becomes unusable and you hand that to a data scientist that we've trained through some of my data Palooza and they don't even know what to do with it. And so this is about really streamlining how you ingest data, how you work with data. You know, one of the roles we talked about when we did the Spark launch was not only are app developers and data scientists, but there's data engineers. These are the people that make sure the data is in the right place at the right time. It's an important role in companies that we see. Because we always want to get the competitive angle. How do you guys view yourselves vis-a-vis the competition? Again, how you mentioned briefly Splunk, there's a variety of others out there putting out big solutions. What do you guys bring to the table and how do you guys differentiate versus the competition? Well, we're number one in Spark. Nobody's made it an investment that even touches what we're doing in Spark. And in terms of the traction that we have with clients, it's enormous. Hadoop is still a critical play for us, but as I said, that's been commoditizing. It's nice to see some other people that have joined the Spark bandwagon recently. At SAP, some made some announcements, which was good. Cloudera made some announcements. But I think we're still the only company that has a holistic view around Spark, which is not just about running on Hadoop. It's about a unifying force to access data. So we think we have a pretty unique point of view. Contributions, we've made an open source are unique and the investment is untouched by anybody. I'd like to get you guys both your takes on a comment I heard from our opening. If you're not inside the tornado, you're going to be spun out, talking about some of the vendors that aren't getting profitable and or might not have the right product market fit for that as the big data analytics markets and clouds exploding into. So the question is, is that you guys have been kind of interesting. Your big IBM, we've been covering you guys, we know, get the large scale, get a huge customer base. Same time you've been doing a lot of work in the trenches with developers, the AMP lab you mentioned with Spark. So you're kind of seeing the ecosystem in the trenches, developer community. What is that new formula in the ecosystem? What do people need to think about from a vendor standpoint, whether you're a startup, private company, or big guy, to solve the customer's needs? Because the customers are saying, I want solutions now, you got to go too slow. I mean, that's my words, but I'm just paraphrasing what I'm seeing, which is, guys, go faster, pedal faster. Integrated, I don't want to have a religious architectural argument. I need solutions, I need a platform that's going to scale up and scale out. Yeah, it's hard. I mean, I think for people who are selling point solutions or they're focused on a very narrow slice of the overall big data pie, I think it's very challenging. And I think you see a number of vendors partnering up to get traction, right? Because doing it on your own is really challenging. I think what I've seen, I was at a client just the other day where they were actually working with Spark to make data access a lot better and then complimenting that with some of our Watson APIs on BlueMakes. And so you see this kind of gap that exists between kind of the data engineering piece and then the application building piece. And I think in general, it's really hard to bridge that gap. And I think where our focus is, is making really strong contributions and investments into the development community and the open source community. But at the same time, continuing to work hand in hand with our installed base, right, with our clients. Rob, I want to get your take specifically on the question around what does IBM offer a startup or growing company in terms of a Lego block architecture? Some IP. I mean, we saw at one of the HP events where columnar store stuff from Vertica, you have venture backed companies, companies getting venture backed by OEMing their kind of engine. You never would have saw that 10 years ago. Whoa, you can't OEM someone's technology. But in a Lego block architecture, their value shifts on top of that. So that brings up the notion of there is a now collaborative model from a technology perspective. What are you guys offering out to the community that people can build on? So we've been investing for about a year in a new division called Cloud Data Services. And the whole idea is how do we take all of the data capabilities we have and start to stitch those together as a set of composable services. So any application can be built on top of that and can use a NoSQL data store or can use Hadoop or can use Spark or can use a traditional MPP data warehouse. There's not many companies that can do that at scale. And we started opening that up to the open source community. We've had over a thousand people come use our Spark service since we launched it. This is called Spark service? Yeah, that's Spark service on Bluemix. And it's been, it's still in beta. It's incredible uptake. But I think the value people see is not only is it a stable service, but you can integrate it with things like an MPP warehouse. You can integrate it with Hadoop. You can integrate it with something like Cloudant. And it really helps you create a fluid data layer underneath your applications or if you're a company, it creates that fluid data layer for your enterprise. So that's what we're doing for those types of companies is bringing them a level of scale and capability that they could, that they could really never get on. Of course you're doing the stuff with open source in Spark. But this teases out the new developer model. I mean, this is not like just write pure code. This is a mix and match. This is an integration game. Yes. Which is an engineering game. Not so much just a systems integration, but like real engineering. And we want to take over that integration heavy lifting. I mean, the other piece of the puzzle here is what we're doing with Watson analytics as an example. People that use Watson analytics spend a lot of their time right now just trying to get the data together. And if we can provide this fluid data layer, which we're doing, it makes getting started with something like Watson analytics that much quicker. So you're looking at as enabling platform. Yes. You look at your role and your group as to enable solutions on top of it without lock in. Yes. I mean, some lock in I guess, but it's a hard and top. Well, the idea is these are composable services. So if there's another service that you want to use, which may be from a third party, this is open to anybody, then that's fine with us. But our point is, if you bring it to a place, it's the company that really knows data. Explain that use of user composable services. Explain that notion to the folks out there. So think about it this way. Today or in traditional world, if you want an MPP database, you've got to go buy hardware, install an MPP database, ETL data in. So what if instead that capability was just available as a service on the cloud? So you don't need hardware. You don't need to provision anything. You don't need to move data. It's just there. It's a callable API and you can use it. It totally changes the nature of how you do analytics. And so what we're doing with cloud data services, we're saying we're going to take all of that off your plate as the small company or as the client and let you focus on how you're going to get value out of your data. And then you start to bring in things like machine learning and algorithms that they can pick out of a library. It becomes very powerful. Joel, I want to get your perspective Rob can chime in to Andy Jassy from Amazon. I had dinner with him last month at the Linux conference and he said Redshift is a fast growing service on his cloud. And that's essentially just commoditizing the data warehouse business. And his whole point is that they're lowering it down to a price point that's huge gap from where the existing incumbents were. So how do you guys see that market for IBM? Because everyone we talked to is like that is just the low hanging fruit from a disruption standpoint. Are customers looking for a new cloud way to do data warehousing? Is that the easiest path? Are you seeing something similar to that with your business? Yeah, I think what Rob just pointed out is exactly right. I think it's a symptom, it's not necessarily the cause when I hear comments like that. I think what we're after is really about reaching new communities. And if you look at where the emerging communities and the emerging interest is coming from, areas like IoT is really big. So you're watching, where is the new data coming from? Then I think that's a really big place, which is why we've made a huge investment there as well. So I think part of the thing that Amazon did early was getting in with the developer crowd way early on, and that's leading to other adjacent interests. Well that flexibility, they have a flexibility, I've always loved the flexibility store. That resident, I love this idea of service, use of composable services. That gives you flexibility. But the price points are so ridiculously level. I mean, you talk about MPP database. It's true, but you also get what you pay for, right? I mean, our data warehouse business is growing very well. And that's because there's a class of workloads that, you know, Park Cell was not a successful company, right? So we all know why that is. So they put up a nice service, but if you think about first class workloads in a retailer, or a bank, or an insurance company, that's not going to redshift, because you need a level of performance that you can't get there. You need a level of reliability in the system. So there's clearly a market there, and we're participating there. And we're seeing system Z with Linux now. I mean, you're seeing a shift. Now, nothing's going away really. At the end of the day, it's all going to be use case driven, right? It's all, to me, it's all about the maturity curve, right? If you're, if it's as simple as I want to reduce costs for this type of workload, something like, you know, a cloud warehouse that you describe is really good for that. But we're bringing, you know, what we do on the enterprise level, we're bringing that to the cloud. And that's why DashDB is the top performing data. You guys done a good job with your acquisitions and your product stuff. Congratulations on all that work. I do want to ask a point of question for the folks out there watching. What is the common thing that you see out there that's a misguided principle, or just a rumor, or just a kind of off the rails concept that people think is actually happening? So in other words, what's the reality of the market? What are people, what a couple things out that's happening in this world that people are reading about in the press, that isn't real? And this market is very hot, we know that. Big data analytics. What are some of the myths that are out there that need to be debunked? I have just one short comment. I think there's a myth that all machine learning is created equal. I think that, you know, there are a ton of new machine learning entrants coming onto the market that are basically either, A, developing like really low level custom solutions, or they're, you know, selling their version of a machine learning library. I think what we're going to see is a convergence, like we saw in the early days of SQL. And I think to me, that harkens back to what we did with system L. That to me is a big myth right now. And people are propagating this myth that our machine learning is better than that person's machine learning. When in fact, we need a standard kind of foundation. So they're building on machine learning into their messaging. Kind of say, hey, we're actually learning. Yeah, exactly. So I see that as kind of a big kind of thing going on. Rob, what's your take on myths out there that the customers are buying into that may or may not be true, or just kind of directionally not on the right track? I think adoption of new technologies is a bit of a myth right now. So certainly there's a set of companies where technology is being adopted like crazy. Ours is one of those. Then you've got the long tail of companies who claim we're growing 30, 40%. I mean, they're growing from $1,000 to $1,400. We double the market share from one to 2%. So adoption is not happening as fast as people think. The reason for that is there's a skills gap in organizations. There's a big skills gap. And it's about, most enterprises have a lot of IT skills. The problem is that solving the data problem is not necessarily about IT skills. It's about data science skills that we've talked about a bit. So until that skill gap closes, which is a big part of why we decided we were going to go educate a million data scientists, adoption is going to be slow. We want adoption to be fast. It's not as fast as the media perceives right now. Well, good luck, guys. Great job. Any quick update on what's happening as data pollution coming up? What's the highlights for you guys this next quarter, next six months? Sure. So data pollution is in November, but it's actually a world tour. So then we hit 12 or 13 different cities over the next six months. So that will be big. We've got Spark Summit coming up in Amsterdam. We'll have some interesting stuff to talk about there as well as our insight conference in October. So we've got some big news coming soon. We also have a bunch of stuff going on right here at Stratus. So if you happen to be in the New York area for Stratus Hadoop World, come on over. We're doing a couple keynotes. We have a ton of sessions. So come stop by the booth. Final, final question for you, Rob. Biggest learning over the past year in your efforts that you gleaned from your open source, IBM, analytics, it's certainly been a moving train from a product standpoint. What's your biggest learnings you can glean out of what's happened for you guys? Starting a company is hard, is my learning. So we're in a big company, but what we've done with Spark is we've really built a new company inside of IBM. We've done everything like a standalone company would do in terms of website, hiring, media, you name it. Building a company is hard. I've always heard that, but I've never done it. So it's hard. That's been a big one. I feel like a startup for a big company is also hard as well. It is, yeah. That's an added dimension. Rob Thomas, Vice President of Product Analytics, Jill Horowitz, VP of Marketing here at IBM. We'll be right back with more live in New York for special big data NYC as part of Stratus Hadoop. We'll be right back after this short break.