 from Galvanize, San Francisco. Extracting signal from the noise. It's theCUBE covering the Apache Spark community event brought to you by IBM. Now your hosts, John Furry and George Gilman. Okay, welcome back everyone. This is theCUBE special presentation along with IBM and the Spark Summit community going on here. Live in San Francisco is theCUBE's special IBM Spark community event here. Live in San Francisco at the Galvanize Incubator workspace, a lot of developer space here, all the people programming. I'm John Furry, my co-host George Gilbert, big data analyst at Wikibon.org. Big special thing happening in San Francisco this week around Spark Summit. IBM's big announcement around their commitment. Billions of dollars, that's my words, there's no official announcement, Bill is just kind of connecting the dots, but a lot of investment in a technology called Spark. Here to explain it is Beth Smith, general manager of the analytics platform. Welcome to theCUBE special presentation of the Spark community event here. In San Francisco, so I gotta ask you, explain to us, I mean we know why, I want the audience to see that, is Spark is this little thing that's kind of developing. It's like out of Berkeley, that's where the operating system world started in the 80s, started a whole systems movement around Unix and Linux. Again, Berkeley is the center point of all the action. In comes IBM and just creates this huge floodlight of attention onto Spark. Big commitment from the top down senior management on this Spark technology. It's around big data, explain what this is, why, such a huge, and the world's responding, so everyone wants to know, explain to us, why now, what's this all about? So I think it actually starts with the fact that the decade and maybe even we would say the century is defined by data and analytics and the insight from it. That's why we call it the insight economy. Spark is the analytics operating system for that to just unlock that value. Why Spark now? I mean some people say it's not ready, but everyone that I talk to certainly is salivating over getting their hands on Spark, making it go faster. Certainly at Hadoop Summit last week, we were live all week and those sessions were packed. Is it, it's like a new engine, it's just like it speeds things up. Is that kind of what it's all about? I know you guys just kind of like, hey, you saw that and you're gonna throttle the acceleration faster. Is that the purpose? Give us some more insight into around the accelerant and all these resources, why, why so much? So of course it does deal with some challenges that the Hadoop stack has had and is good for the Hadoop and HDFS world, but it's much broader than that. It's really about having a universal access to data to again, get intelligence into applications and systems all the way from popular applications to the internet of things. You know, we love open source. You know, I remember to open Compute two years ago as a data center conference. You know, Facebook donated reference designs even as you came in and donated all this, and they were just like, oh my God, they're giving away their core jewels. And it turns out that was a major advantage because now the community has some stuff to work with. Talk about what you guys are doing, what you're donating to the Spark community. You guys are bringing some intellectual property from behind the closet at IBM's core labs into the community. What is that and why are you doing that and what's the impact? So there's really three elements to what you end up doing. It's the technology skills and it's the overall community. And so from a technology standpoint, we're donating game-changing technology, system ML, that will really open up the opportunity for more data scientists and data developers to build applications. So we're seeing here at Galvanize in San Francisco a lot of training sessions people are learning. You guys are putting what, 3,000 plus researchers and data scientists on the case office in San Francisco, and then you're gonna be doing a series of trainings. Can you collaborate more on that? Yeah, so we've committed to train, help educate a million data scientists, data engineers around the world on Spark and the surrounding concepts. You know, one of the things God, George. I just wanted to key back into something you said a moment ago about Spark as the analytics operating system and that data is the competitive advantage to transform industries. So what specifically about Spark now makes that possible relative to a relatively rich ecosystem of analytic capabilities that's out there now? Well, we all know that data's the next natural resource. And so to be able to fuel the intelligence in applications, it's about unlocking insights from that data. Spark brings the ability to access universal data, not just what's in Hadoop, but data from all different sources and we can see that expanding out even more. That along with its unified programming model just gives the power to go into a number of things that aren't just relegated to batch, which we see with Hadoop. Can you talk about some of the applications like commerce, perhaps web sphere that you would see Spark plugging into? In other words, not just the ecosystem of other application vendors and middleware vendors, but IBM's products and how it would plug in. So as John referenced a few minutes ago, we've actually committed 3,500 researchers and developers to bring the value of Spark into our analytics platform, into our commerce platform, and to be leveraged as a part of our business solutions that we take to market. Okay. So I wanna talk about this analytic operating system concept you were alluding to. That's new. So what does that mean specifically? Is that just an application-centric framework? So how do you guys talk about that? Because I mean, I love that, I love that anything operating system I love, but analytics is hot, and it seems like at Hadoop Summit again last week, we were speculating and talking about what the trend was, and it was as if the cloud DevOps world was waiting for the killer app. And that's analytics, right? So you get the DevOps world looking at the analytics trend and saying, hey, I can power faster analytics. So what is that analytics operating system? Is it just app-centric? Is it gonna talk to DevOps? Because you got blue mix in there. So it seems to be stacked up horizontally across a lot of different IBM groups. Is it designed that way, and what does that mean? Am I reading into it too much? So we've seen for years that an operating system kind of brings the nervous system to workloads and systems. Take the same analogy and apply it to analytics. So yes, is it connected cloud? Of course it is. We have Spark as a service available in beta on blue mix so that developers can now pull that in to their applications that they'll be using. But we'll have a number of different places where Spark shows up. And it just comes back down to this point of an operating system, I would say, is like that nervous system. It gives you the capability that you need. It's an underpinning for solutions above it. Share some color on the commitment, because again, I wanna try to understand the level of crater or impact this is having on the community, because it's growing. You guys now just opened up even more range for Spark. What is the main announcement? Can you just give some color to Apache Spark, IBM's commitment and what it means? I mean, obviously the research and stuff, but like for someone out there watching a CIO or a developer, what's really going on? Well, we have a long-term history of supporting open source. You all know that. And if we think about the decade of e-commerce and e-business and some of the things that we did around WebSphere and again with Apache, that was a key support then and contribution we had. And now if you think about this being, like I said, the decade of data and analytics and how you're gonna get those insights, then it's about, okay, a community around that. And we are very much committed to community and open source becomes a part of it. And so as a part of these announcements, it's about IP contribution. It's about skills and training people and it's about our own people being a part of the community as well. I think that's wonderful. I think the IP is key. And the other thing that's kind of like a dancing around this whole announcement is your role in founding member of the AMP Lab. You guys were one of the four founding members, I believe, at UC Berkeley's AMP Lab, which was only a few years ago, 2009, right? So okay, so one of the things that's interesting about open source is it used to be open sources like this and big vendors, we don't want their fingerprints in here, but you guys have been there in the beginning and the world's changed now with open source. So what's different from open source years ago versus today? It's much more active and collaborative, it seems. But you guys are really successful with this as a founding member and now this announcement. So take us through what's it like now to execute as a company serving huge customer base. I mean, you have a huge customer base who wants open source. So we have been an advocate of open source from the beginning, from the mid to late 90s. You know, we were one of the first companies to really embrace what open source could be because at the end of the day, it's about innovation and innovation comes from a community and it's about then how do you unlock that business value and how do we get more people, more businesses around the world being able to benefit from that? I think Apache too, we picked this up with talking to Hortonworks last week, is that Apache set up well for anyone because you're not just if you work for a company. So companies are actually operating well for Apache. Does that make a big difference as the Apache governance model helped there? Absolutely and you know, we've been a sponsor of Apache since the very beginning in 1999, I think it was. Okay, keying off that, and you IBM alluded to this I think in the headline of the press release. When IBM really got behind Linux in 1999, it's kind of like the world shifted on its access a little bit. Do you see that happening again this time to the IT executive who's divorced from the data analyst or I should say data scientist community that's a little bit on the fringe from their attention. Why is this platform now relevant way beyond data scientists to the broader IT community? So the IT community is missioned, challenged, right, to help provide the systems and technologies to let business unlock value from their data and to get their competitive advantage in how they transform themselves, how they transform their industry based on data. So this ties right hand in hand with that. The IT community needs to be able to support developers and the applications that developers need to create. I think the appeal of developers is a big one and again you brought up the point of customers. One of the things that ODP teases out at Hadoop Summit last week was that there's a need for supported stable code to operationalize analytics. So I want to ask you that operational question. How do you guys see the acceleration of what you're doing here? Take us forward down the road on operationalizing analysts because this is moving really fast and it's baking out as fast as it can be baked out but people are still integrating it into their products. So it's kind of the Wild West with a spark if it will. So how do you guys get your arms around that and contribute in the open source as well as provide value to your customers? So we are in California. So like you mentioned though, it really is about innovation and it's about consistency and what clients can depend on and how they have different options. It's to enable clients to not be locked in to a particular vendor and so it's different areas depending on whether or not you're talking about ODP and where its level of maturity is or what we're talking about now in the very early stages of the partnerships and innovations around Spark. I love the, just to kind of come back to the Wild West and you mentioned California one, it's an investment in San Francisco office is one of the big parts of the announcement. Wall Street Journal's headline is, IBM embraces Spark at big data's real-time frontier. So let's talk about real-time because this is something that you guys are really plugged into. I know from doing the CUBE interviews, multiple events with IBM, real-time versus near real-time. Talk about that dynamic because it is a frontier. It is a Wild West, but there's some real value being extracted there and it's changing the game with data science and programming. What's that frontier about real-time? Well, that's one of the things that Spark unlocks that Hadoop is more limited in that Spark really allows a closer to near real-time which in many cases may be all that the workload needs. That along with some of the things that we can do around our streams technology which really then does allow you to do analytics in real-time and apply context of what you're learning, what those analytics are learning along the way. So it's a combination of data at rest and data in motion and that comes back to how do you best leverage all the data you have. So I got to ask you kind of an internal IBM question. How does this change your world? Because now that's the big researcher on it. The topic's X. Jenny's gonna be like, okay, where are we? You guys got an execution plan behind this. It's not just donating some IP. You guys gonna be on the road. You're gonna be educating a thousand data scientists. You have these huge developer focus things. It's going on across the company. But has this changed your world? Because this is now, again, the frontier. You're selling picks and shovels for everyone. I mean, what's going on? How does it affect your plans? So clearly there's an IBM team behind it. It's not about me personally but I do have the lead for IBM to execute this. And anybody that knows me well knows that I absolutely focus and that I work with a team on, okay, how do we accomplish what we set out to do? And that's what we're definitely gonna do here. And the million educated numbers, was that pulled out of the air? Is that a target? Is that gonna be like a million developer march? Is there a cadence to that? Is that a specific milestone you're shooting for? That number or more, what's kind of? Well, we think that it's important to have an impact, right? And so that number represents an impact. And we already have work underway. We're one of the biggest contributors to Big Data University MOOC. And as a part of that, we already have 262,000 folks signed up on that. We now have Spark Fundamentals courses there. So that's a way to do it. And we're expanding that into Brazil and China. It's already in about five or seven different languages already, but we're gonna have local lines for Brazil and China. Well, I ask the question because IBM doesn't just have numbers out there. So I'm gonna just kind of teasing you out the next question, which is, does that change the total addressable market for data scientists? Because you guys just don't pick a million numbers. This sounds good, you say a billion as scientists, but there aren't a billion data science yet. So that number must come from some sort of systematic targeting. I mean, is it span the scope? Because if Spark continues to succeed, the definition of data science also changes. So give some context to that number. Is that like 20% of the market? Is it gonna grow it? I mean, if this happens, it's a rising tide. So what's behind the total number around the data science? Do you have any insight there you can share? Well, when you think about data and it being that basis of competitive advantage, then it really is about having as many people in different companies around the world as possible that can unlock that. And that, I think, is what's at the heart of data scientists. So it's about growing the number of data scientists and data engineers as well, and the folks that really will be dealing with that level of analytics. You know, Beth, keying on that million number, there was a well-known McKinsey study a couple years ago that we're gonna have a shortage of data scientists of x many hundreds of thousands and you're gonna have to work with these organizations if you wanna get around that. There's a movement to make tools more accessible, to make data science more accessible, just the way business intelligence tools did 10, 20 years ago. Is that something IBM's gonna participate in and is that part of the million number effort? Absolutely. In fact, we sponsored a hackathon over the weekend and one of the tools that we brought to that was sort of like a workbench that the team had created to be just another tool to help enable those data scientists. So it's about education, it's about tools, it's about systems, it's about the entire end-to-end story. Talk about the end-to-end thing because you're talking about the machine learning is a starting point. It's almost like, here, this gets first grade, get going at the machine learning and you build on the machine learning and that could go end-to-end. Can you take us through the machine learning to the, well I would say machine learning is for the developer but then what's the other end of the spectrum when you say end-to-end? Is that embedded analytics? Is that integrated? What does that end-to-end mean? Is this a solution or is it a technology stack or both? I would say it's both a solution and it's a technology stack. But let's start with solutions. Solutions are about how do you get business value and that's where it's important to have intelligence built in to solutions focused on the domain and the industry that the solution is aimed at. And then in order to do that, you think about the underlying technology that helps make that happen. So in my notes here, I have this, you know, software engineering, analytics for the future, design science, these are kinds of the things that are coming into the app developer world. Used to be like, hey, I bang out a software app, you know, load it up, web-based and mobile comes down a little bit more and more design goes into the native mobile app. Now when you add in analytics, you're having a whole nother engine of near real-time or real-time capability. The design component goes in. Can you share some insight into how you guys see that vision because it's not just UI and UX design. Data now will impact UX. You haven't, can you share any color on how that fits into the whole analytics piece? Absolutely, so, and we all know that design is not just about the UX. I mean, that's an element of it, no doubt about it, but design is a critical aspect of offerings and solutions and it's about all aspects of what the design may be and the experience as a part of it. In fact, you know, I might argue that the developer of the future is probably gonna be a blend of that application developer that we've traditionally known, plus a little bit of a data scientist, plus a little bit of a designer because I think those things will blend together over time. And agile will change. That definition of agile with fast data will be interesting and contextually relevant too. Absolutely. One follow-up question on John's question about intent solutions and tech stacks and you would respond that the solution has business value and the underlying tech stack. So what are the tech stack elements that you would call out for, you know, an IT person who's trying to wrap their head around this and say, okay, my platform, this analytic OS, has these components. So we're already leveraging Spark as a part of Big Insights for Apache Hadoop. We have work underway within the analytics, IBM analytics platform team all the way from the containers, information integration, governance, et cetera, to say how do we bring those concepts and technologies together. We have work underway with our systems team around power systems as well as system Z to say how do we optimize for this but also how do we unlock the value of things that clients may be using in those environments as well. Beth, I wanna just close out by just kind of tying it all together. So you guys certainly at IBM had a huge experience with the Watson brand and that's just gotten great press but there's a developer cloud and just more recently you guys have been looking at the integration. So internally at IBM, you've been kind of looking at this integration of data. So I gotta ask you, I'll talk about the technology center here in San Francisco for 25 market with the Amplab partnership, Databricks. Why San Francisco? Why not like in Chicago or somewhere else or New York? Why San Francisco? Why did you pick here? Is because it's close to Berkeley. Is it more concentrated here? What's the rationale behind it? Well, you answered the question in the question because it is about here's where Amplab is, here's where Databricks is and this is our Spark Technology Center. This is not the only place in IBM that'll be working on Spark. In fact, we'll have across probably a dozen or more of our laboratories around the world people working on it but this is the center of where we become a member of that community that's really shaping what this is about. Any other expansion behind Databricks and Berkeley? You guys obviously are open so it's open sources, not like it's just those guys. That's where today is. If you're gonna do the million developer march and go out and train all these data scientists with all these resources, what's next? Is there other universities you're talking to? Other places? Let's get through today, okay? Yeah. But you guys probably had that on the radar, right? Right, oh absolutely. It's not just Berkeley or Databricks. No, it's not, it's not. And in fact, we have partnerships with Galvanize. Of course they're not just in San Francisco even though they are here and we have partnerships with Metastream and DataCamp as a part of what we're doing with the education aspects and so you'll see it expand out but it just made sense for a Spark Technology Center designers and engineers and data scientists that we're gonna really be spending all of what they do in the community to be located here. I gotta ask one final question because we got a wrap on the announcement. Is there anything that surprised you? Because, I mean, I knew it was gonna be big but I didn't think it was gonna be as big with the clippings and all the press coverage. What's the one thing that surprised you that's taken you back? Wow, this is big. So I was absolutely confident that this was a big move and I'm glad to see that the world saw it that way. Right on the plan. Beth's always focused here inside theCUBE. Great to have you on. Thank you. We'll get to the explanation of the big idea. We'll be unpacking this all day today till nine o'clock at night. Stay here on SiliconANGLE.tv and collaborate with IBM. This is their community event here, the IBM Apache Spark. Spark Insights, go to crowdchat.net slash Spark Summit and use the hashtag Spark Insights. We'll follow the conversation, join the conversation. We'll be right back with more after this short break live in San Francisco for the IBM community event.