 Live from San Francisco, it's theCUBE, covering Spark Summit 2017, brought to you by Databricks. Welcome to theCUBE, I'm your host, David Goad, and we're here at the Spark Summit 2017 in San Francisco, where it's all about data science and engineering at scale. Now I know there's been a lot of great technology shows here at Moscone Center. This is going to be one of the greatest, I think. We are joined here by George Gilbert, who is the lead analyst for big data and analytics at Wikibon, George, welcome to theCUBE. Good to be here, David. All right, so I know this is kind of like reporting in real time, because we're listening, you're listening to the keynote right now, right? Yeah. All right, well I know we wanted to get a start with some of the key themes that you've heard. You've done a lot of work recently with, you know, how applications are changing with machine learning, as well as the new distributed computing. So as you listen to what Matej is talking about and some of the other keynote presenters, what are some of the key themes you're hearing so far? So one, there's two big things that they are emphasizing so far this year, or at this Spark Summit. One is structured streaming, which they've been talking about more and more over the last 18 months, but it officially goes production ready in the 2.2 release of Spark, which is imminent. But they also showed something really, really interesting with structured streaming. There've always been other streaming products and the relevance of streaming is that we're more and more building applications that process data continuously, not in either big batches or just request response with the user interface. So your streaming capabilities dictate the class of apps that you're appropriate for. And the Spark structured streaming had a lot of overhead in it, because it had to manage a cluster. It was working with a query optimizer. And so it would basically batch up events in groups that would go through like once every 200 milliseconds to a full second, which is near real time, but not considered real time. And I know I'm driving into the details a bit, but it's very significant. They demoed on stage today. I saw the demo. Okay, they showed structured streams running one millisecond latency. And that's a big breakthrough because that means essentially you can do per event processing, which is true streaming. And so this contributes to deep learning, right? Low latency stream. Well, it can complement it because when you do machine learning or deep learning, you basically have a model and you want to predict something. And so the stream is flowing along. And so for every data element in the stream, you might want a prediction or a classification or something like that. And Spark had okay support for deep learning before, but that's the other big emphasis now. And before they could sort of serve models like in production, but training models was somewhat more difficult for deep learning. That was, that took parallelization they didn't have. Okay, so there were three demos that kind of tied together in a little bit of a James Bond story. So maybe the first one was talking about image classification, transfer learning. Tell me a little bit more about what you heard from there. I know you need to mute your keynote. Okay. The first demo from Tim Hunter. Yeah, the demo, like with James Bond was, we're going to show, among my favorite movies, they show cars, they're learning to label cars. And then they're showing cars that appeared in James Bond movies. And so they're training the model to predict was this car seen in a James Bond movie. And then they also have, they were joining it with data that showed where the car was last seen. So it's sort of like a James Bond sighting. And then they train that model and then they sort of ran it in production, essentially at real-time speed. And the continuous processing demo showed how fast that could be. Right, right. That was a cool demo. That was nice visual. Well, then we had the gentleman from Stanford, Christopher Reeve came up to talk more about the applications for machine learning. Is it really going to be radically easier to use? We didn't make it all the way through that keynote, but yes, there are things that can be used to make machine learning easier to use. There's, for one thing, like if you take the old statistical machine learning stuff, it's still very hard to identify the features or the variables that you're going to use in the model and deep learning. Many people expect over the next few years to be able to help with that. So the features are something that a data scientist would collaborate with a domain expert. And deep learning, just the way it learns, the features of a cat, like here's the nose, here's the ears, here's the whiskers. There's the expectation that deep learning will help identify the features for models. So you turn machine learning on itself and it helps things, among other things that should get easier. But we're going to get to talk to several of those keynoters a little bit later in the show, so we'll do a little more deeper dive on that. Maybe talk to us just generally too about who's here at this show and what do you think they're looking for in the Spark community? Well, Spark was always a bottom-up adoption first because it fixed some really difficult problems with the predecessor technology, which was called MapReduce, which was the compute engine in Hadoop. And that was not familiar to most programmers, whereas Spark, there's an API for machine learning, there's an API for batch processing, for stream processing, graph processing, but you can use SQL over all of those and that made it much more accessible. And the fact that now machine learning's built in, streaming's built in, all those things, you basically MapReduce, the old version, was the equivalent of assembly language. This is, you know, at a SQL level, language. And so you were here at Spark Summit in 2016, right? Yeah. And so we've seen some advances. Would you say it's incremental advances or are we really making big leaps? Well, Spark 2.0 was a big leap and we're just approaching 2.2. I would say that getting this structured streaming down to such low latency is a big deal and adding good support for deep learning, which is, you know, now all the craze. Although most people are using it for essentially vision, listening, speaking, natural language processing, but it'll spread to other use cases. Yeah, we're going to hear about some of those use cases throughout the show. We've got customers coming in. I won't name them all right now, but they'll be rolling in. What do you want to know most from those customers? The real thing is Spark started out as like offline analytic preparation of data that was in data lakes. And it's moving more into the mainstream of production apps. The big thing is what's the sweet spot? What type of apps? Where are the edge conditions? That's what we, I think, will be looking for. And when Matei came out on stage, what did you hear from him? What was the first thing he was prioritizing? Feel free to check your notes, I think you're taking it feverishly. Well, he was talking about, I mean, he did the, you know, the State of the Union as he normally does, the astonishing figure that there's like 375,000, I think, Spark meetup members. Wow. Yeah, and that's grown, you know, over the last four years, basically from almost zero. And so his focus really was on deep learning and on streaming. And those are the things we want to drill down a little bit in the context of what can you build with those? All right, well, we're coming up on our first break here at George. I'm really looking forward to interviewing some more of the guests today. So thanks very much. And I invite you to stay with us here on theCUBE. We'll see you soon.