 Live from Galvanize, San Francisco, extracting signal from the noise, it's the Cube. Covering the Apache Spark community event. Brought to you by IBM. Now your hosts, John Furrier and George Gilbert. Hello, everyone. Welcome back to another installment of the Cube. We're live in San Francisco for a special Cube presentation for the IBM Spark community event, live here in San Francisco. Galvanize workspace, start-ups around. Basically it's a developer environment. It's really massive. I'm John Furrier, founder of SiliconANG, I'm John Michael. George Gilbert, big data analyst at wikibond.com. James Cove BLS with IBM. James is the senior program director of product marketing in the big data space we've been on the Cube many times. Welcome, guys, to the kickoff of this wall-to-wall, full-day coverage of the Spark Summit and the IBM community event that's happening here tonight in San Francisco. Let's jump right into it. James, I got to say, I was tweeting last night, IBM is upping their game, and I kind of did that tongue-in-cheek. It's a power move with capital P, because power is kind of a... There's a power story in here, but this is... You have a product called Power Systems, but keep going. Yeah, and all the analysts in the memory stuff. So really some major trends happening here. So IBM is announcing global impact news today, huge announcement behind Spark, big trend with obviously analytics that you guys have been executing on. What does this IBM move mean here in San Francisco? Spark Summit, a growing geek community of developers in big data, certainly it's got implications. IBM coming in with a global endorsement. What does this mean? Well, what it means, first of all, is that IBM recognizes that Spark in many ways, Apache Spark, it's an open source project, has achieved great traction uptake for some very important and increasingly important big data analytics applications and use cases. Really it's all the use cases focused on in-memory analytics, streaming analytics, graph analytics, doing machine learning with massive parallelism inside, really far more memory-centric clusters of computing to enable fast analytics and machine learning for things like the Internet of Things and stream computing applications of a real-time low latency nature. So IBM recognizes, we've been watching the Spark community grow and grow in terms of uptake, in terms of the number of committers, in terms of the number of applications that have been built, in terms of the range of startups and the VC money flowing into the Spark industry as it's developing. It's a very immature industry, but so was Hadoop five years ago. Look where Hadoop is today. So we recognize that Spark actually addresses a lot of limitations we've all recognized with, I'm going to say, traditional Hadoop, which sounds like a funny phrase to use for something that's as young as Hadoop, but Hadoop MapReduce in particular is far more geared to batch processing, batch analytics than to true real-time Spark. Spark, by the way, is a part of the Hadoop distro. That's what confuses some people. It's not like it's a standalone capability. It's a sub-project that's been part of the Apache Hadoop distro for, I think it's about a year or two now. I forget exactly when it was open-source. So the Spark community has developed within the Hadoop community. And this week here in San Francisco, I've just come from the Spark Summit down at the Hilton Union Square. In many ways, it's a coming-out party for the Spark industry, which is growing very rapidly. If you look at the range of vendors, including, of course, IBM, we made a number of announcements today. I'll get to those in a moment. So in many ways, IBM is showing some... Well, first of all, showing the industry the level of commitment we've already built up internally. We have a lot of developers who... In fact, we just completed a... We call Hack Spark Challenge internally, where dozens upon dozens of our own data scientists and developers worked for almost a month and today we'll be announcing the winners of the Hack Spark Challenge. We also had a sponsor to hack a Theathon here this past weekend at this facility, Galvanize. And that was a lot of great submissions and they presented, and we're also going to announce the winners of that. What we're getting is that there's a lot of excitement in the developer community, the data scientist component, especially the developer community around Spark. What we want to do is we want to show the industry that IBM understands that Spark is important to the future of big data development to show that we are committed to a number of things. We have open sourced, system ML, our machine learning library to the Spark community under an open source license. Now today we've announced that. We've also announced that we've established a development center, a center of excellence for our customers and our partners and IBM as two developed Spark applications. Spark Technology Center is at the IBM office here in downtown San Francisco. That was announced today. We are also very soon going to be going into launching a beta of a Spark as a service offering on BlueMix. That's a very important announcement for today. We're also announcing that we're also very much deeply committed to the open source Spark community and we have a range of partnerships that we've announced today. Databricks is one of them. In fact, Databricks is the primary mover. At least they have the core team that developed and invented Spark and are commercializing it. That was, we've of course announced our deepening commitment to the open source process. Spark 1.4 was released this last weekend, remember it's open source, and it incorporates a code that was developed and submitted, contributed by IBM, especially Spark R. So you can now develop Spark applications using R as well as Scala and so forth. And also we've announced that IBM is committed to integrating Spark into the wide range of our analytics and commerce offerings and also into Watson. We have the Watson Health Cloud with Spark inside. We've announced that as well. What I'm getting is that IBM has committed resources and developed and declared a roadmap for bringing Spark more completely into our portfolio. Big resources, it's not like little resources. It's a big move for IBM. From the top on down, from the executive level on down, it's a strong ongoing commitment. That's not brand new. We've been working on these things for a couple of years clearly. This is essentially the coming out to show the industry what we've been working on and how important it is to IBM, and we know to our customers going forward. This is a huge move. I was tweeting again last night, this is a big, big move. It's worth unpacking, looking under the hood, behind the curtain, whatever you want to use as a metaphor. George, I want to get your opinion. Spark is agile and easy to use, but it's grown over the past year in terms of commitment from developers. So let's take a step back. What is Spark? Spark was born in 2009 out of UC Berkeley, where the U.S. Revolution, Linux, I mean systems, I mean Berkeley has changed the game in many generations of tech. So again, another historic Berkeley connection. UC Berkeley. IBM has been a partner with Amplab. One of the founding members of Amplab. So it's not like you're Johnny Kimley. You guys have been there in the beginning. And the Databricks guys, by the way, came out of Amplab, Matheza Haria and whatnot to found Databricks. And one of their board members, Pete Sonsini, at NEA invested in him. He's a Birk, he's a Cal alum. So a Cal connection here comes back, reminds me of old days of systems, operating systems with Linux. So George, what is Spark? I mean this is action. Everyone's talking about Spark. What is Spark? Why is it important? What is going on? Because Hadoop was the big thing. We're in the trough of disillusionment now with Hadoop as Gardner is talking about. But now Spark is ramping up with attention. Why all the focus on Spark? Well I want to key off something Jim said earlier, where it's not a replacement for Hadoop, but an augmentation. And you might say since it does ship with all the major Hadoop distros, you might say it's a potential substitute for some components of Hadoop, just the way Hadoop has so many components itself that can be used in other cases. It's a compliment, not a substitute for Hadoop. It's a compliment. I mean the Spark runtime engines like Spark SQL are a compliment to Hive or on the queries, where I was going to say, and they're also a compliment to MapReduce and so forth in terms of a distributed execution engine and programming framework. So it's not an alternative. It's a compliment, an augmentation. For many users it'll rest on the foundation of what's come to be regarded as Hadoop 2.0, which is Yarn and HDFS. And in that respect, it's certainly very much part of the Hadoop ecosystem, but the one thing that I want to key off that you had said earlier, Spark is one unified engine that has personalities to do the old batch processing that MapReduce might have done and now is taken care of perhaps by other workloads. It's actually several engines, not just one. But the personalities then also include SQL, machine learning, streaming, graph processing. But maybe you can tell us, give us a use case where you might want to use that in an application that needs real-time intelligence, at the time of a customer interaction where that would be a little more difficult to do with some of the Hadoop components above the level of HDFS and Yarn. Yeah, let's say the Internet of Things. Let's say your customers are using wearables and those wearables are sending real-time, say, geo-coordinates on lat-long, elevation, street address, whatever it might be. And that's machine data that's flooding in real-time as behavioral data, that and other behavioral data that may be produced and sourced from, say, those wearables that is used by you as a merchant, let's say, in an in-store environment to, in real-time, to tune the experience of the customer on, say, their Apple Watch or whatever it might be to wherever the customer happens to be in the store, what aisle they're in, what they're looking at, and so forth, depending on the kinds of behavioral data sourced from that wearable and other apps, let's say the customer has on their iPhone. And so what that involves in a Spark context is, A, you have streaming data, real-time. Spark streaming is for close to real-time data. It's basically it's mini-batching. It's not exactly continuous streaming, a la IBM InfoSphere streams, but it's close. But it's also behavioral data that you need to do graph analysis on to find the patterns relevant to, for example, individuals' relationship to each other in social networks like families and friends in and around the store and so forth. You need to do, you, meaning the merchant, need to do queries in real-time using Spark SQL to be able to look at the data being fed in to process it through machine learning to query the patterns in terms of like, you know, to do predictive analysis of what the customer might respond to in terms of real-time offers presented in the context of, let's say, some wearable application so that you can then tune an offer in real-time to where they're at now or where they may be in the next five minutes and so forth. What I'm getting at is that these are the kinds of applications that for, say, e-commerce where there's a new generation of endpoints, meaning mobile endpoints, meaning wearable endpoints and so forth that are generating all manner of machine data that you need machine learning to be able to find the patterns for. That's where the various capabilities of the various Spark engines come into play, meaning Spark is a unified architecture then for bringing all that data in together and finding in real-time the patterns and doing predictive analysis so that you can then serve the customer better. That's, you know, in many ways one of the showcase kind of applications for Spark. Okay, so that level of integration in real-time where each piece can call on the other piece in sort of this pipeline, this real-time pipeline. How would one go about building something similar with the traditional Hadoop components? Well, you might use, instead of MapReduce in an alternative, you might use, say, InfoSphere Streams for the streaming component, you might use MapReduce and Hadoop for processing the historical data related to customer profiling and so forth. And then in terms of machine data, you might use a machine data accelerator, such as the one we provide on our Hadoop distribution begins sites. There's a lot of componentry in the Hadoop ecosystem pre-Spark that one might use for that kind of application, as well as other capabilities, like other streaming environments. It doesn't have to be an IBM. You can use, you know, Storm or SAMSR, some other real-time streaming environment in a composable framework to do those kinds of applications. James, this brings up a point that I want to drill down with you, which is the announcement that the hard news is, obviously, the commitment to Spark and the industry. So it's not like an IBM land grant. You guys are coming into the community. Far from it. No, you guys are really doing a good job, like you did with Linux. So I've got to give you props for that. But you've got IBM Cloud, which is emerging. Some say you need a lot more. We're not announcing any proprietary closed source or anything. Hey, we're open source folks. We're committed. Go ahead, James. Yeah, and again, the rising tide. Spark R, you know, we're behind that. Totally open source, but it helps flow up the IBM Cloud. That's the other thing that's going to help the cloud group. So the cloud group gets a benefit from that with the BlueMix service. But you guys are integrating this across the company. So I want you to talk about that impact, because you brought up kind of, you can use any vendor. You guys are going to support Spark, but then also bring in your applications. This kind of talks about the need in the market around apps. So obviously it's the trend of Internet of Things, all the plumbing we talked about. But at the end of the day, developers. What does it mean for them? First of all, it means training. Spark is unfamiliar to many developers, many data scientists. The majority of, say, the Hadoop community is still getting their heads wrapped around it. That's why, for example, at Hadoop Summit last week, and I was there, many of us were there, a lot of the program was about Spark. You know, it's not like these are clean separations between Hadoop and Spark. Because the reason it's all that content at Hadoop Summit is that the Hadoop community needs to know more because more and more Spark tools are coming into the kinds of apps that they're building. We are doing that kind of education internally and with our customers and partners going forward. Part of our announcements today, excuse me, where we are IBM, we are investing in education and training on Spark for our worldwide ecosystem of partners, as well as our customers, as well as our own developers to get them up to speed in all things Spark. BigDadyUniversity.com has a really strong online MOOC for Spark that we are pointing our developers towards to get their heads up to speed on that. But also, we are very much activating our network of partners, like Galvanize, to do the training. Galvanize is one of many training partners that we are bringing into this environment. Yeah, this is an environment. Is there a zillion places like this that are developer focused? Because that's fundamental. You've got to do the training first and foremost to get the app developers to understand what it is, what they can do with it, how it relates to the Hadoop tools, MapReduce tools and so forth they've been using, how it relates to our distro and Cloudera's and Map. A lot of moving cars. We all have Spark, by the way, in our distros because it's part of Apache Hadoop. And there were some really good announcements from some of the competitors I mentioned this morning at the Spark Summit. Because we're all educating our respective customers at the same time and so we're all emphasizing training. Well, let's drill down on what you said earlier. The developers were salivating at Hadoop Summit of the Spark. Those are my words. It was exciting. I'm saying they were salivating. Because why are they excited? Because, one, it's a brand new processing engine. It's like, why not have a turbo engine or some sort of new innovation that's going to make life great. Two, it's teasing out really what's going on in the data science world. So machine learning, which is an underpinning of this announcement, is you guys are donating a lot of that machine learning piece. So you got the machine learning component. You have this new engine which developers are like, finally, I need more horsepower. It's like Star Trek. Scotty, give me more power in the engine room. So Spark brings that. But it's still early. So do you agree with that statement and would you add anything to that? Because that's why the Hadoop guys are like, yeah, I just need to go faster. So it's not so much replacing Hadoop. Talk about the dynamic of the engine and then the impact on data science. The dynamic of the engine. Well, there's engines. Well, processing engines you can spread around. Yeah. So, first of all, Spark is built on massively parallel, I mean, Spark is geared towards massively parallel, iterative modeling of data for real time exploration and tuning of big data models for graph analytics and so forth. So what gets everybody jazzed is the fact that this is fast by its very nature. It operates in memory with Spark. You don't have to write the results, intermediate results back to disk. It stays in memory. So it's very much feeding on the growth of in-memory platforms like blue acceleration and so forth. So the excitement around all in-memory platforms is very much fueling the excitement for Spark. The need for speed in all applications, real-time low latency speed, is fueling the excitement around Spark. The excitement around machine learning, all things deep learning and so forth, to be able to find patterns in video streams and audio and so forth, behavioral streams in real time is fueling the excitement around Spark. These are the new frontiers in data science generally. So we have the Machine Learning Library. It's the IP that we've built up over several years that is a key productivity accelerator for data scientists. We recognize that in order to really build the Spark market, and really we are building a market. You're accelerating with the tech you're donating. 3,500 researchers are working on this. So in other words, we're open sourcing system ML. It's available to anybody, including our competitors, customers and partners, because we all recognize that we all need to accelerate the development of this marketplace. Towards maturity, it's not a mature market. We all recognize that a lot has to happen for Spark to become a mature market on the solutions and the tools side. A lot still needs to be put together. Above and beyond what's already in Hadoop, Hadoop actually is pretty much mature now if you look at it. Spark will be, I predict, where Hadoop is now within the next two to three years because I think Spark will ramp the maturity curve faster, running on the elephant's shoulders. And you guys at the end of the day are going to educate over a million data scientists through the partnerships with Air Lab, Data Camp. It's going to call that the million developer march. You're going to call it that. We're going to call it here in the Cube. George, big time stuff here. We're going to wrap up the segment here. We're going to do a drill down. We're at San Francisco for the community event, IBM Spark community event, Spark Insight. Live in San Francisco, the galvanized workspace where all the developers are here, startups, it's an incubator. At Spark Summit, we'll be dissecting all the news, looking at all the angles. We're going to talk to folks from Berkeley, some experts from IBM. We're also going to get to the lay of the land and what this IBM move really means. Stay tuned here at siliconangle.tv. All day, we're going until nine o'clock at night. We will go, we'll do whatever it takes to get the story about what's going on with Spark and IBM's role in this evolving ecosystem. So right after I leave the Cube, Beth Smith will be on. Beth was the one of the keynoters this morning at Spark Summit. So Beth can give you a fuller extent of IBM's commitment from the top down. We'll have Joel Horwitz, who is our chief of marketing for emerging markets in Joel in many ways. He's the prime mover behind this. Who was in The New York Times, by the way. Did you see his photo in The New York Times? I haven't looked at it this morning. Was he the instigator, like the sponsor? I think Rob and Beth, right? And Bob kind of driving this thing? Yes, and Joel and his team. They'll be on. Harriet Freiman will be on. I think a lot of people are familiar with Harriet from long back. We're going to have Mike Tamir from Galvanize. And then later we'll have Shankar Venkataramanan. Sorry, Shankar, if I'm getting your name wrong. My colleague who's in the area of Hadoop. Shankar can really then tie this in directly to the Hadoop market. And as well as to our Big Insights product, which by the way includes spark support already. Then Amplebs Fernando Perez will be on later this afternoon. Robert Parkin, one of IBM's crack data scientists will be on to discuss what he's doing with spark. Then we'll have later on a town hall panel downstairs here at Galvanize. We're going to have dignitaries like Ed Dumbill of the Silicon Valley Data Sciences will be chairing it. And Shankar will be on. So will Mike and Paco Nathan of Databricks will be on. As well as Fernando Perez and someone from Taki on Nexus. How are you on Lee? David Townsend will be on a little bit later. He'll talk about the Spark Technology Center we've established downtown here. Then we'll have Rob Thomas. Rob was going to go on earlier, but Rob will be on at 6.30 now tonight. And Rob will always give his very insightful point of view on the guts of Spark and where we're going on the IBM is on the technical side in terms of building out Spark into our solution portfolio. And then towards the end of the evening we'll have lightning talks from a wide range of partners and customers all around what they're doing with Spark. It'll be exciting. That's what I'm really looking forward to. We're going to hear real tangible application discussions. And then I think Paco Nathan comes on to close tonight from Databricks and Paco is always interesting to listen to. So we've got a packed schedule of really good programming here. Interviews with the Cube as well as the live streaming panel. I think it'll be a sensational day. And if you follow the at IBM Big Data Twitter handle all day long you'll get a straight feed of what's being discussed. I know that we'll have a continuous feed of good stuff on that handle because I'll be the guy tweeting behind the scenes. See, I'm like the Wizard of Oz. I'm just pulling the strings here. It's a great partnership in collaboration with IBM this special Cube presentation. Again, Spark Summit is happening at the Hilton in San Francisco. We are here at Galvanize the workspace. We're going to bring you all these Spark insight here on the Cube and of course coverage of Spark Summit. And join the conversation. Go to hashtag crowdchat.net slash Spark Summit. Join the conversation. Use the hashtag spark insight in there. We have to answer your questions. The Cube will be back with more. Beth Smith coming up. All the top executives, experts here to unpack the big IBM news here in San Francisco. We'll be right back after this short break.