 Live from Boston, Massachusetts, it's theCUBE, covering Spark Summit East 2017, brought to you by Databricks. Now, here are your hosts, Dave Vellante and George Gilbert. Welcome back to day two in Boston, where it is snowing sideways here, but we're all here at Spark Summit, hashtag, Spark Summit, Spark Summit East. This is theCUBE, SiliconANGLE's flagship product. We go out to the event. We program for our audience. We extract the signal from the noise. I'm here with George Gilbert, day two at Spark Summit, George. We're seeing the evolution of so-called big data. Spark was a key part of that, designed to really both simplify and speed up big data-oriented transactions and really help fulfill the dream of big data, which is to be able to affect outcomes in near real time. A lot of those outcomes, of course, are related to ad tech and selling and retail-oriented use cases, but we're hearing more and more around education and deep learning and affecting consumers and human life in different ways. So we're now 10 years in to the whole big data trend. What's your take, George, and what's going on here? So even if we started off with ad tech, which is what most of the big internet companies did, we always start off in any new paradigm with one application that kind of defines that era, and then we copy and extend that pattern. For me, on the Rethinking Your Business, the McGraw-Hill interview we did yesterday was the most amazing thing. Because they took, what they had was a textbook business for their education unit, and they're rethinking the business as in, what does it mean to be an education company? And they take cognitive science about how people learn, and then they take essentially digital assets and help people on a curriculum, not the centuries old sort of teacher, lecture, homework kind of thing, but individualized education, where the patterns of reinforcement are consistent with how each student learns. And it's not just a break up the lecture into little bits, it's more of a, how do you learn most effectively? How do you internalize information? Well, I think it is a great example, George, and there are many, many examples of companies that are transforming digitally. I mean, years and years ago, people started to think about, okay, how can I instrument or digitize certain assets that I have for certain physical assets? And I remember a story when we did the MIT event in London with Andy McAfee and Eric Ben-Yolson, they were giving an example of McCormick Spice, you know, the Spice Company, who digitized by turning what they were doing into recipes and driving demand for their product and actually building new communities, and that was kind of an interesting example, but sort of mundane. The McGraw Hill education is massive. Their chief data scientist, chief data scientist, I guess, the head of engineering, I guess, who he was. A VP of analytics data science. He spoke today and got a big round of applause when he sort of let off about the importance of education at the keynote. And he's right on, and I think that that's a classic example of a company that was built around printing presses, distributing dead trees that has completely transformed and is quite successful. Over the last only two years, brought in a new CEO, so that's good, but let's bring it back to Spark specifically. When Spark first came out, George, you were very enthusiastic, you know, you're technical, you love the deep tech, and you saw the potential for Spark to really address some of the problems that we faced with the dupe, you know, particularly the complexity, the batch orientation, even some of the costs associated with that, those hidden costs. So you're very enthusiastic. In your mind, has Spark lived up to your initial expectations? It's a really good question, and I guess techies like me are often a little more enthusiastic than the current maturity of the technology. I mean, Spark doesn't replace Hadoop, but it carves out a big chunk of what Hadoop would do. Spark doesn't address storage, and it doesn't really have any sort of management bits, but so you could sort of hollow out Hadoop and put Spark in, but it's still got a little ways to go in terms of becoming really, really fast to respond, you know, in near real-time, like not just human real-time, but like machine real-time, and it doesn't work sort of deeply with databases yet, so it's still teething, and every sort of every release, which is approximately every 12 to 18 months, it gets broader in its applicability, so there's no question sort of everyone is piling on, which means that'll help it mature faster. Well, so when Hadoop was first introduced to the early masses, not the mainstream masses, but the early masses, the profundity of Hadoop was that you could leave data in place and bring compute to the data, and people got very excited about that because they knew there was so much data, and you just couldn't keep moving it around, but the early insiders of Hadoop, I remember, they would come in the cube, and everybody was, of course, enthusiastic, and a lot of cheerleading going on, but in the hallway conversations with Hadoop, with the real insiders, you would have conversations about people are going to realize how much this sucks someday, and how hard this is, and it's going to hit a wall, and some of the cheerleaders would say, no way, Hadoop, remember, now you've started to see that in practice, and the number of real hardcore transformations as a result of Hadoop in and of itself have been quite limited, and the same is true for virtually most, anyway, technology, not any technology, I'd say the smartphone was pretty transformative in and of itself, but nonetheless, we are seeing that sort of progression, and we're starting to see a lot of the same use cases that you hear about, like fraud detection, and retargeting, as coming up again, and I think what we're seeing is those are improving. You know, like fraud detection, I talked yesterday about it used to be six months before you'd even detect fraud if you ever did, now it's minutes or seconds, but you still get a lot of false positives, and so we're just going to keep turning that crank. Mike Walteri today talked about the efficacy of today's AI, and he gave some examples of Google, he showed a plane crash, and he said it said, plane, and accurately identified that, but also the API said it could be wind sports, something like that, and so you can see it's still not there yet, at the same time, you see things like Siri and Amazon Alexa getting better and better and better, so my question to you, I'm going to long wind it here, is that what Spark is all about? Just making better the initial initiatives around big data, or is it more transformative than that? Okay, so interesting question, and I would come at it with a couple different answers. Spark was a reaction to, you can't have multiple different engines to attack all the different data problems, because you would do a part of the analysis here, push it into a disk, pull it out of a disk to another engine, all that would take too long, or it would be too complex, a pipeline, to go from one end to the other. Spark was like, we'll do it all in our unified engine, and you can come at it from SQL, you can come at it from streaming, so it's all in one place, so that changes the sophistication of what you can do, the simplicity and therefore how many people can access it and apply it to these problems, and the fact that it's so much faster means you can attack a qualitatively different setup of problems. Well, I think as well, it really underscores the importance of open source and the ability of the open source community to launch projects that both stick and can attract serious investment, not only with IBM, but that's a good example, but entire ecosystems that collectively can really move the needle. So, well, big day today, George, we've got a number of guests. We'll give you the last word at the open. Okay, what I thought, this is going to sound a little bit sort of abstract, but a couple of two takeaways from some of our most technical speakers yesterday. One was with Jan Stojka, who sort of co-headed the lab that was the Genesis of Spark at Berkeley. The AMP lab at Berkeley. Rise Labs. Yep, and then also with the IBM Chief Data Officer for the analytics unit. Seth Dobrin. Dobrin, yes. When we look at what's the core value add, ultimately it's not these infrastructure analytic frameworks and that sort of thing. It's the machine learning model and in its flywheel feedback state where it's getting trained and retrained on the data that comes in from the app, and then as you continually improve it, that was the whole rationale for data lakes, but not with models. It was put all the data there because you're going to ask questions you couldn't anticipate. So here it's collect all the data from the app because you're going to improve the model in ways you didn't expect. And that beating heart, that living model that's always getting better, that's the core value add. And that's going to belong to end customers and to application companies. Yes, I'm one of the speakers today. AI kind of invented in the 50s, a lot of excitement in the 70s, kind of died in the 80s and it's coming back and it's almost like it's being reborn. And it's still in its infant stages, but the potential is enormous. All right, George, that's a wrap for the open big day today. Keep it right there, everybody. We got a number of guests today and as well, don't forget at the end of the day today, George and I will be introducing part two of our Wikibon big data forecast and this is where we'll release a lot of our numbers and George will give a first look at that. So keep it right there, everybody. This is theCUBE, we're live from Spark Summit East. Hashtag Spark Summit, we're right back.