 Live from New York, it's theCUBE covering big data NYC 2015. Brought to you by Hortonworks, IBM, EMC, and Pivotal. Now your hosts, John Furrier and Dave Vellante. Welcome back everyone. We are here live in New York City for theCUBE SiliconANGLE's flagship program. We go out to the events and extract the signal noise. Again, behind us, you'll see we are one block, 100 yards from the Javits Center where big data NYC is happening here and Strata Hadoop is one block away. I'm John Furrier. My co-host Dave Vellante, our next guest, Randy Swanberg. Distinguished engineer at IBM, welcome to theCUBE. Thank you, it's good to be here. Distinguished engineer, we're going to get down and dirty. So two things I want to ask you is, one is Spark. Why is this show, which we're calling Sparkworld at this point, is so spark driven? Is it the focus on analytics? Is it just the sexiness of Spark? Is it because IBM's investing a ton of resources into it? I'm only kidding, I might be the reason, but why is Spark so hot right now? Well, I think, you know, Spark has a lot of attributes that everybody's excited about. One of the biggest attributes right now is this unification of analytics. So if you look at, you know, even Jeff Jonas this morning during the keynote, he was talking about analytics in context and how the game is kind of changing. So the more that you can bring all of your data sources together, the better your insights are going to be. And Spark is a platform that can do that. I mean, it can bring all of the data sources together and you can do different types of analytics on one platform. We had Rob Thomas on and Joe Horwitz on yesterday talking, kicking off the whole day and obviously we saw them when you guys committed to Spark and all the investments. A lot of training going on, you know, whatnot. But more importantly, IBM's weight behind Spark and then recently this AI kind of thing going on with Watson, you kind of kind of see the dots connecting in the IBM ecosystem. One is a company that sells software and services to companies. And one that's been a player in open source going back to the Linux days now, you guys are now balancing both ends kind of in this innovative way. Is this the new IBM? Is this the old IBM? Is this the old IBM that's been modernized? Describe what's going on at IBM because you're seeing Watson clearly driving down the endless path with machine learning, mainstream processing, all the stuff that's next in the systems of engagement, systems of intelligence as George Gilbert at Wikibon points do. But it's pretty clear, Watson's going to be that thought. How does open source play in and how do you balance it? Right, I mean clearly IBM is embracing open. Not just on the software front, but obviously on the hardware front now with open power. I'm sure we'll get into that a little bit. But the keys of open source, it's really about identifying what are disruptive technologies? What are the technologies that we see that are changing the industry? Because if we change the industry, what we care about is our clients. How can we help our clients realize the value of disruptive technologies? So you said Linux as an example, forming the Linux Technology Center a decade and a half ago or however long that was is an example of IBM recognizing Linux as a disruptive technology and the openness that it provides. And at a time, Linux was not a tier one platform. Open source in general was kind of gorilla, if you will, wasn't a tier one software platform. Exactly, it wasn't commonplace in the enterprise. And that's one of the things that IBM brought to Linux was investment in that. We've even upped that investment recently in the past couple of years, as you've seen around Linux on power. So it's really about identifying these open technologies, seeing how we can merge our innovation on top of rather than, I would say it this way, the new IBM, it doesn't always have to be invented here. That was the old IBM, it was all invented here. Now it's about collaborating with open communities, creating value on top of open platforms. So Randy, I wonder if we could talk more about Spark. We just did a survey in Wikibon and obviously Spark very popular. I think it was the number two software component running on Hadoop clusters, interestingly. People also suggested that many workloads that they would have run on Hadoop, they're going to move towards Spark function. So what's Spark all about? What problems is it solving that Hadoop couldn't solve from your point of view? Right, I think the biggest thing is Spark's an in-memory design. So the typical problems that you would solve on Hadoop used to be with just a map reduced paradigm. And so each iteration of your analytic workflow, as you move from stage to stage, there was a lot of disk IO involved. So the results of the first iteration are written to disk, you read those back in, you move to the next. So performance-wise, anytime you have a lot of disk IO, you're looking for those bottlenecks, how can we alleviate them? So the Spark guys, they recognize that. And especially for things like machine learning, algorithms are very iterative in nature. So if I, in between each iteration of my algorithm, if I've got to do a bunch of disk IO, I'm just killing myself. So really the fundamental value is around the in-memory design. So Spark can create these pipelines of work that you want to do on one copy of data while it's in memory. And that's why they're seeing these speed-ups of 10x, 100x, compared to doing the same work on Hadoop. The best IO is no IO as Gene Mdahl would say. System guys are having a party now. It's all coming back to them. Okay, so the constraint now goes from where? You have the horrible storage stack that you're sort of eliminating essentially with Spark. So all the attention, now we're talking about microseconds, nanoseconds, whole new ball game. Talk about system design from that perspective. What's changing? So the interesting thing about the system design, and you said it, you said, most people are experimenting with Spark now on their Hadoop clusters. If we were to plot the evolution or the maturity of what I would call big data, folks started out by just saying, how can I store all of this data more economically? Low-cost storage. So if I use commodity hardware and I distribute my data using HDFS, Hadoop HDFS, it's a low-cost storage, that was kind of the first entry. And from there it evolved into, now I can actually offload my data warehouse stuff that I used to pay exorbitant enterprise data warehouse solutions. Now I've got all this data stored cheaply. Now if I just add some data warehousing techniques, the data lakes were introduced, things like that. And then the next phase, and I believe we're somewhere between that and starting this next phase, which is trying to get more value out of that data. Old analytics used to be I couldn't afford to do the full set of data. I had to do a sample size of whatever my data was. But now with improvements in system design, improvements in technologies like Spark that improve the performance, now we can use the full data set. Now we can aggregate more and more data so that the insights get better. And the problem is, doing that on the existing designs that supported that original vision, they weren't really designed for this new world of shifting the problem from disk IO to now compute in memory. Because now disk is not my problem. Now the work has enabled me to do more complex analytics. I can, as you said, I can do real-time Twitter streaming. I can grab that, put it into a table format so I can query it. I can do machine learning on it. So the more complex that is, the problem kind of shifts from how much disk do I have and how much it costs me to really how much compute and memory do I have in my cluster. So we're seeing Spark, certainly, everyone's tapping into the Spark goodness, if you will, and awesomeness, many vendors see value in that it's creating for them. IBM has a lot of things. They have Power8, OpenPower, ODP I now has called, kind of with the cloud. So cloud is a key part of that. So Power8, OpenPower, accelerating that momentum. So that seems to be a value equation for you guys. So I want you to explain to the folks out there what is the Power8 value proposition? And let's start there. I want you to get that out there and then we can drill down into where that value is being created with the role of Spark. Okay. So many have probably heard the Power8 marketing tagline, designed for big data. And really that is at the core of the Power8 value proposition because being an engineer, it's not all about marketing taglines. There's actually meat underneath that Power8 is designed for big data. And not to get too detailed here, but I mean, just to give some meat behind that, every core of a Power8 has eight hardware threads. So it's four times the number of hardware threads that you get on other platforms. So these workloads that need compute density. Now I'll give you an example, Spark SQL. We've done a lot of analysis of Spark on the Power platform. We've been tuning, we've been learning how to Spark behave under all of these different workloads. And just to give you an example, Spark SQL, the more threads that you can give to Spark so he can parallelize queries across rows, the faster it goes. I mean, we're saying three X performance on Spark SQL once we feed it all of these Power8 cores. The other thing, not just thread density, but you have to have memory bandwidth. So Power8 has huge memory bandwidths. Depending on the model and chip you get, we're talking 200 gigabytes per second. That's necessary to feed this big data to all these threads. So George Gilbert and the Wikibon team led by Dave and George is talking about the systems of intelligence. This next era, you know, Bob Pucciano always talks about systems of direct or system of engagement. Now we're in systems of intelligence, which is cognitive computing. This is the moonshot for IBM. This is kind of where you guys are going. So I got to ask you that, you know, we had a comment on theCUBE yesterday that someone said, some data may never hit a database. So you're essentially getting into this thing. Multi-threads, multiple threads, multiple cores. With Spark, there's this notion that this machine learning, aka algorithmic software environment might have nothing to do with a database. It might be all about flow theory and or streaming and time series, processing, all that stuff. Is that kind of, am I connecting the dots there right? I mean, can you elaborate on that concept? You are. I mean, in fact, it kind of gets back to that, you know, unique ability with Spark that I don't have to write data back out to anything if I don't want to, to take the next step with it. I can create pipelines and I may end up at the end of a complex pipeline flow with an answer and along the way, I accumulated a bunch of things I never wrote anywhere. So it is that, you know, again, power eight with the memory bandwidth, also the large caches. You know, we're talking 5x the cache per core compared to other platforms. So, you know, you take these big data problems, especially again, these iterations on the same data, keeping that all in cache. You can make these, make these insights screen. So Randy, you're not from the analytics group, but I'm going to ask anyway, because people ask me all the time systems of intelligence, which is kind of our language, which we stole from Jeffrey Moore, but evolved it, I think, much more than his sort of concept of systems of intelligence. People say to me, well, it's different than what you're saying and what IBM calls systems of insight. And the response that I get from my team is, well, we're talking about bringing analytics and transaction systems together, bringing, taking systems of record and evolving them into systems of intelligence. Is that consistent with your view of what you call systems of insight, or is it different? I think it's the same to the degree that systems of insight want to have access to all the data. It wants to have access to the historical data. You know, what would, you know, 10 years ago, just transactional database data that made its way into warehouse. So it's the blending of all of these things, which I think, you know, you want your analytic systems to have access to the data from systems of record. So that's why you see connectors being developed for things like Spark to access any data source. So I think it's fundamental to the future of systems of insight. And just a follow-up question. I mean, you were talking about the performance on Power 8, relative to what? Is it like 3x the performance of relative to what? Previous generations of Power or relative to x86? Relative to x86. Okay, specifically. That's what you're benching on. Those are internal benchmarks that IBM has evolved. We've got some public log entries out there that talk about our performance. So what about ODP? IBM's part of ODP. ODP comprises, you know, certainly some Hadoop vendors, not the least of which is Hortonworks. What is the relationship between Spark and ODP? If you can speak for ODP, I don't know if you can, but you're a portion of the IBM representation. So I mean, ODP, again, kind of embodies this open platform, right? The open platform that, similar to, you know, IBM's desire to embrace Linux, our desire to embrace OpenStack, our desire to embrace a number of open source initiatives, again, where we see they create a foundation. They create a foundation for subsequent innovation. And so ODP is really about, you know, standardizing, if you will, a collection of components that make up the Hadoop ecosystem. Because before, if you think about, I don't know if anybody's tried to do this on their own, but if you're trying to roll your own Hadoop, you know, there's 50, 100 different separate projects out there. So you might pull, you know, ZooKeeper from here, you might pull HBase from here, you might pull HDFS, you might pull whatever, and you might be the first one that's putting those specific versions together for the first time. So the value of ODP is, essentially, we're going to create a collection of these open source components that make up the Hadoop ecosystem, Spark being one of those components, test them all together, and really the focus of ODP is around testing and certification. So that then the members of ODP, they don't have to recreate that effort on their own. They know that they have a base. It's really about increasing the adoption of Hadoop when you think about it, because the more you have standards, you know, I'm an old UNIX guy, right? So if you go back to the UNIX days of all the standards, you know, POSIX and UNIX 98, 95, really the point of a sun or an IBM or an HP embracing those standards is about the vendors creating applications that can move between them. So to the degree that we can standardize the big data platform, it's going to create freedom of movement, it's going to create adoption. And of course, that didn't happen to the degree that we had hoped, and then there comes Lennox. Yeah, right. But somebody said the other day in the cube that we're seeing sort of a replay of that in the Hadoop ecosystem. Is that a fair comment, do you think? I think it is. I mean, I think it kind of follows that paradigm. And, you know, I just, when you said that I thought of something else, what kind of ties in, you know, Beth Smith, who's the general manager of analytics, she's coined this phrase that spark is the analytic operating system. So it seemed to be, you know, years ago, so much focus was at the operating system level. Is this operating system better than this? And then virtualization, you know, of course, Z did it in the 60s, but you know, virtualization hit, you know, the UNIX industry. And all of a sudden it was about hypervisors and virtualization. Well now, all of those things are almost in the noise when you're talking about a big data platform, because Spark is the new operating system. So it's very analogous to some of these things. And combining that with Hadoop, you're talking about standardization, we're talking about portability, ease of use. It's certainly exciting time for software developers and DevOps Cloud, and obviously, you know, people writing apps at the edge of the network. So I mean, obviously, this is a huge opportunity. Practitioners want more signal, less noise. This seems to be the trend. So I got to ask you one final question to kind of end the segment is, what is your take on the vibe of this week in big data NYC? The show is in the sixth, seventh year, six years with theCUBE. We've seen it all. We kind of have our own opinion, but there's a maturation going on. What's your take of it? What is the main message you're seeing that's going to come out of this week? For me, I mean, I'll be honest, I'm just kind of blown away with the new things and forms of analytics that are possible now. Basically no industry is going to be left unturned when it comes to analytics. Everything from, you know, I think of critical things like healthcare with respect to the improvements we can make in genomics or the improvements we can make in personalized patient care. And then on the other end of the spectrum, you've got, you know, how does analytics make the sporting industry easier to watch? So it's just, to me, it's that tip of the iceberg of we're just unlocking all sorts of things. It's the old Irish expression, belly up to the bar with solutions or you're not going to be, it's no more hype, it's time to get real. This is go time, right? And now, you know, for IBM, we need our customers to realize the value from this innovation explosion that's happening. Put your money in the barrel, put your solution out on the table, that's what it is. To me, that's, I 100% agree. This is game time for people with solutions because customers want to go faster. They want analytics. An analytics game. We are here on theCUBE getting all the data, sharing that with you here at Big Data NYC. We'll be right back. Live from New York City, we're all the actions happening live in the moment. We're going to bring it to you. Keep on going. Go to siliconangle.com for all the action. Wikibon.com for the research and siliconangle.tv for the video coverage. We'll be right back and stay tuned all night. We have extended coverage this evening for a special presentation, new research we're unveiling, and then, you know, party at seven. So we- Big Data after dark. And we've got the unicorns coming, and everyone's going to be right back after this short break.