 Live from Union Square in the heart of San Francisco. It's theCUBE, covering Spark Summit 2016. Brought to you by Databricks and IBM. Now here are your hosts, John Walls and George Gilbert. And good morning and welcome to San Francisco. We're here for the Spark Summit 2016, as Spark kind of comes home. I'm John Walls and now we're back in the Bay Area where Spark was launched some seven years ago by Matej Zaharia, one of our keynoters today here at the Spark Summit. So we're looking forward to talking about those remarks and really bringing you an exciting couple of days of what's happening in the world of Spark and these fantastic creations that are certainly launching this technology into a whole new sphere. I'm joined today and I'll be with George Gilbert for the next two days as a matter of fact. George is the senior analyst with Wikibon and theCUBE and George, good to be with you again. Let's talk about first off, what Matej was talking about today. He said, we're really kind of evolving Spark into this world of continuous applications, kind of new nomenclature. What do you take from that? At a high level, this is really big picture. We've had for the most part applications that have two modes of processing for really the last 60 years. You either run a batch of things like run your payroll or you put up a screen for a user to interact with. So batch interactive. And now it's going to be joined. We've had pockets of this, but we're going to see major applications shift to a model where they're always processing. And the name that the Spark folks have given to it is continuous applications. The idea that it's like a stream of data, it never ends and the application's always working. That has big implications because we've always designed our software around interactive or batch. But this is the only way to keep up with the torrents of data that are pouring in now from machines, from billions of phones, and tens and tens of billions of sensors and other devices in the future. Well, they've talked about the launch of the new release of Apache Spark 2.0 later on this month. And talking about some major improvements, there's some wrinkles. And streaming is certainly part of that. Structure to APIs is one, but structured streaming was, they spent a lot of time there talking about structured streaming and why that's a big deal. Okay, so when Spark was really, really young, several years back, one of the first things they did that was really appealing to developers was they made it very easy to work with streams and traditional tables or batches of data, I should say, at the same time. In other words, you didn't have to learn really two different constructs. What's happening now is that they're making it so that they're essentially exactly the same. The only difference is, you think of a stream as a never-ending cable and you start processing it and Spark just takes care of the rest. Every time a new batch of data comes in, it sort of runs the query, figures out what it needs to and puts it off wherever you're going to store it or feeds it into the next stream. And the critical thing here is it's the people sort of skills. We don't have to retrain people to figure out a whole new way of dealing with streams which is such a new concept for most people. I mean, we talk about these massive amounts of data you have coming in, not only from conventional sources but embedded sensors, internet of things, mobile devices, it's creating this tidal wave of data. How, even with something as advanced as Spark is with that technology and capable of processing data and streaming data and all, how do you keep up with all that? I mean, how do you avoid, these almost seem like inherent latencies that are going to occur because you're pumping so much data into the pipeline? Well, that's one of the things that's appealing about Spark which is from its very design, it's meant to scale out and scale in depending on the volume of data coming in so that there's something called like, think of it like that if you run too much water in your sink, you know, it starts to fill up. So Spark takes, if it senses that condition, it's almost like it opens the drain up a little wider, you know, so that the water level comes down and it can keep up and then it goes back to sort of a normal size, where that opening the drain is like expanding the cluster of Spark machines. You could think of, so you could think of marrying SQL and structured streaming as making sure everyone knows how to deal with streaming just the way they know how to deal with SQL databases which have been around for 40 years plus. And so as far as pushing Spark further into the mainstream, having those two being the 10 pole capabilities of this Spark 2.0 release is a big deal. Yeah, I mean, you talk about the marriage in a way, like this should be an ideal couple, but for some reason, you know, like they're sleeping in different rooms or there's something not quite, but it looks like that those forces are coming together and it is creating this almost this perfect storm of capability. So the way to think about it is enterprise software, it's never delivered like the stork with the baby, it doesn't all just show up, you know, drop down your chimney or maybe a mixing metaphor. That's okay, that's sad of the stork, it's the same thing, okay. But it sort of comes out slowly and so the first kind of use cases for this marriage of streaming and SQL would be like something called ETL where you're bringing data in from multiple sources and you want to display it in a dashboard or, you know, make it possible for people to query it and interact with it using business intelligence tools, whether it's Tableau or Excel. So in other words, again, streams generally don't work with those sorts of tools, but when you put SQL around it, then all of a sudden you can work with live data, you don't have to dump it into a database and then deal with it, you know, after you structure it and then query. So you work with data, you know, fresh, that's real, almost real time. And we still, there's still work to be done, for instance, in the future to make it so that streaming SQL data can train machine learning models online. Not again, right now, for the most part, you have to take that offline into another database or repository and work on building the machine learning model and then put that back into production. At some point, you want that model to be learning as the fresh data is coming in so that if someone, if there's a new fraud pattern that comes up, you know, in a credit card network or with a bank, you know, that model learns right away and can flag a fraudulent transaction. Like, well, we saw that with one of the demonstrations in the keynote today, looking at Twitter feeds, presidential candidate Twitter feeds, and not only evaluating them for, you know, past activity, mentions, references, whatever, but in real time and making, you know, giving you charts, graphs on certain keywords and doing it in almost, as you said, to the second. So it really was, I thought, a pretty impressive display of that kind of streaming capability that Spark is bringing to the marketplace. And then now, to wrap it with all the enterprise-friendly features, like SQL, like clustering, like high availability so if things fail, all that sort of, all that stuff makes it ready for the enterprise. I do want to say something about, you know, the future, because we're painting a rosy picture, but with enterprise software, again, you know, the stork doesn't drop the whole thing right away. I was going to ask you, when does it hit the wall, right? At some point it has a shelf life and boom, that's it. That's just it. MapReduce was the core computation or core execution engine of Hadoop and it had about a 10 year plus life before it hit the wall and we did all sorts of unnatural things with it so it made it essentially what you call adaptive stretch, patches and, you know, workarounds and essentially like a bunch of barnacles. We know from things that are being done to Spark that it has a lot of runway. One, for the internet of things, they're working on ways so that it can do really, really fast processing of really small data items which is what you'll get when you're on these devices out all over the periphery of the network. Up in the cloud it's fine for that but it needs to do really, really short fast processing which it doesn't yet do. But we have, you know, it appears that there's a lot of work going on in that area. We talked about online machine learning. That's, you know, that's even nearer term work that's going on. And you had mentioned earlier how we saw in a keynote from Google's, you know, Google's brain. It's actually the Google brain project but the guy who demoed it is Google's brain. Deep learning. Deep learning is going to take machine learning for another 10 years. And the fact that they're, they've ported, I don't know if it was Google or someone else, ported it on top of Spark, means that Spark is seen as a foundation that's sturdy enough to take that capability out many years into the future. So if you net it all out, it means that Spark has a lot of runway. We're not going to hit the wall anytime soon. Before we, we have a great lineup of guests on tap for today and for tomorrow but before we head off to that, want to ask you about just about the education piece to this, because we've heard a lot about the, the IBM commitment to training 1 million data scientists, what they're doing at Galvin Eyes, not only here in San Francisco but six other campuses around the country. But I heard today about the skills gap and it still exists and trying to close that and in essence broaden the expertise is still a bit of a challenge. There's two ways to tackle that. One is, is what IBM and Galvin Eyes and others are doing with education. But there's also that old truism, you know, about AT&T and operators. You know, if, if we needed a switchboard operator to connect every phone call, the entire, you know, adult population of the US would be switchboard operators. So we are going to train people to be better with data science but we're going to also be building better tools that make them more productive. So in other words, all this machine learning technology we're going to apply it to the process of machine learning to make it easier to use by less skilled people or so that we need fewer of the highly skilled people. Just building more intelligence into the system then basically, more efficiency as well. Yes, and so that'll break the bottleneck but the training and then turning that technology on itself. That's the beauty of Spark and not only for the user but also for the employer is the ease of use, the simplicity of the ease of use and you talked about the elasticity. All those are really big factors that are driving this market forward. All right, we are going to look forward to a great couple of days here. We have a tremendous lineup of guests in store and we will continue with that from the Spark Summit 2016 here in San Francisco on theCUBE right after this.