 live from Union Square in the heart of San Francisco. It's theCUBE, covering Spark Summit 2016, brought to you by Databricks and IBM. Now here are your hosts, John Walls and George Gilbert. Welcome back to San Francisco here on theCUBE, along with George Gilbert. I am John Walls as we continue our coverage here at Spark Summit 2016 from a jam-packed ballroom here at the Hilton Hotel in downtown San Francisco. Really a lot of buzz about Spark right now and for very good reason. Joining us to talk about it on theCUBE is Michael Armbrust, who is the lead engineer and creator for Spark SQL at Databricks. And Michael, thanks for being with us. Yeah, thanks for having me. Yeah, first off, you're taking what's happening here. Obviously a lot of buzz, a lot of excitement around Spark. And I think the very fact that we have, you know, almost 4,000 folks here says that adoption has hit a very critical point. Yeah, it's only been increasing since I got involved with the project about two years ago. And there's a lot of excitement about 2.0 in particular. Yeah, so we'll get into 2.0. I want to say first off, congratulations yesterday on your general session. Live demos are always risky business. Yes. But you pulled it off and you had everybody laughing hysterically when looking at some Twitter activity between the presidential candidates, Sanders, Clinton, Trump, small hands. Love that, but for folks at home who weren't privy to what that was all about, just take us through what you were demonstrating in terms of this real time streaming, this aggregation of data, and really how that would translate to an enterprise sense. Yeah, great. So the key point of this demo is I wanted to show off how easy it is to go from zero to a complete continuous application. So I kind of opened up the demo with what it used to take in Spark 1.0 with just a whole wall of code that would take days to write. And we delete that and we start from scratch. Raw unstructured data could, in this case it was tweets from the election, but it could be IoT data, it could be anything that's just getting dumped on you that you need to answer questions about. So we were able to take that, structure it, do transformations on it, play around with it in batch mode, and then take that exact same code and turn it into a continuous streaming application that was updating in real time. So you mentioned structure a couple of times there. We were seeing in a previous interview with Doug Cutting that intelligence was kind of one of the buzzwords going on this week. Structure obviously important to 2.0 with the release that's coming. So let's talk about that structured API, structured streaming, I mean, what you see as being what's the big booster behind structure and what that's all about. Yeah, absolutely. So if you look at kind of the evolution of Spark, we started with RDDs. We're sharing incredibly powerful, primitive, but they're very low level. It's not the level that your average developer wants to work with. And so our first foray into structure was adding SQL, right, the structured query language. It's tables, it's what business analysts are ready to do. And so when we first moved this into the Spark project, we got a lot of excitement. And so that led us to ask like, what about people who aren't business analysts? What about the data scientists who know Python, who know pandas, those kinds of tools? Can we bring the same power and flexibility that the SQL engine is providing for analysts to these people as well? So that led to data frames, data sets, and now this structured streaming project. So would it be fair to say, now sort of popping up a big step for 60 years we've had batch interactive, and they're these pockets of continuous kind of applications, are we sort of entering an era where we have streaming as this new modality and then if you put them all together, continuous apps, are we on that sort of shift? Absolutely, I've seen this kind of over and over again as we work with various customers. As soon as they get it working in batch mode, you immediately have the question, wait, but new data arrived, what's the answer now? And typically this was starting from scratch or you have to think about exactly what it means to take this question and reframe it in an incremental sense. And our whole vision with the kind of Spark 2.0 and structured streaming is you shouldn't have to think about the fact that it's streaming. You shouldn't have to reason about data arriving. You should say what the question is and the Spark Optimizer, this thing we call catalyst, should be able to figure out how to do that incrementalization. So when going back to your demo, it's almost like batches your sandbox to say, am I getting the logic right? Give me a finite amount of data to work with and then when I'm satisfied with structure, flow and the results, let's turn it continuous. Yeah, that's exactly the vision. You should be able to take exactly the same code, all the tools you know and familiar with, they're kind of easier to play around. I like the word sandbox. I think that that's a nice way to think about it. Okay, thinking back to my undergraduate remedial programming days, it's almost like when you had an interpreted language, you know, you had a, almost like a shell. Yes. And then you would compile it when you knew it works. Absolutely. This is exactly the vision of the interactive notebooks inside of Databricks Community Edition. Okay. So it is your place where you can type code, run it interactively, see immediately what's going on with it and then once it works, click a button and turn it into a production job. And the notebook becomes a production job. Exactly, exactly. So this would be, it's like a computational document. Help us, I know we're stepping past what's in 2.0. Yeah, yeah. What's the vision for these computational documents? You know, what sort of analytical richness, what might we provide collaboration? You know, the narrative around the data that you put in there? Yeah, so I think, I mean, the nice thing is it's not just code. I think there's a couple, there's a couple of nice things about notebooks. One is it's interactive. You yourself can go in and change the code. If I send it to you, you can change the query parameters and see how it fits. This was one of the key parts of the demo was, which topic are we talking about in the graph? Updates in real time. But it's more than just code. There's also some exposition that goes along with it. You can add markdown, you can add images, you can explain the analysis so that when you do publish it and you send it to your boss or you publish it on the internet, anybody can kind of read and understand kind of the steps that you went through while you were doing your analysis. So how do you communicate that? I mean, that's pretty big value there, right? That was untapped before. How are you communicating that to your clients then and trying to show them, open their eyes, if you will, to those really fantastic possibilities? Well, I think one of the big things is actually this community edition announcement. It is anybody can use this for free. You sign up, you get six gigabyte clusters. All you need is an email address. And I think once we have the power of the internet, the power of the crowd, creating these notebooks that explain things, there's going to be just a plethora of exciting analysis. So let's head back on community edition. I don't think we really talked about that much yesterday, but you're basically opening the company store in terms of knowledge and providing this free basis of knowledge for what Databricks is all about, basically, right? So anybody anywhere in the world now has free access, the keys to the card, basically, right? Exactly, and not only just the keys to the club, but also the instruction manual. It comes with built-in lectures from award-winning professors at Berkeley and Stanford that walk you through not only the basics of big data, but also the internals of Spark and how to use that to present rich visualizations and analyses. Because you're talking about five MOOCs, right, with Stanford, Berkeley, and working through edX. So obviously, you're trying to create, you're part of this one million data scientist education process, basically, but you're trying to educate the community at large and doing it at no cost. Fantastic. Be fair to draw an analogy between what you're doing and sort of, this is the Microsoft office of not just business intelligence, but data science, data engineering, and sort of telling a story with data, business intelligence kind of sells that short. But it's a new canvas. Yeah, exactly. Because I think the key part is it really has all of the tools there in one single platform. You've got cluster management, you've got interactive notebooks, you've got jobs once you've got it working and you want to run it regularly. And kind of all of these pieces work together to make it easy to work with data. The community doesn't sit still, right? I mean, it's constant, constant pushing. If you had to look at a particular area, you said, okay, you know, this is where I think we're going to fine tune maybe more so than any other right now over the next, I know, 12, 18 months. What do you think that would be? My biggest focus for the next 12 to 16 months is going to be structured streaming. I think there are tons of things we can do in the continuous application space that are really going to change the way people work with data. Such as, I mean. Yeah, so I mean, really, I think at the moment, structured streaming, we spend a lot of time thinking about kind of what the API is going to look like. And I think what we've done is we've come up with a very simple model. I talked a little bit about how you don't have to reason about streaming. But really what that means is when you write a program, you're just thinking about a table that's getting ever larger. It's just constantly appending new data to this table and your analysis just runs at points in time. Getting that to really work at scale in production where you can even do things like swap out the analysis in real time without missing a beat. That's going to take a lot of engineering effort. We're not there yet, but based on my experience with Spark SQL being introduced in Spark 2.1 and by Spark 1.6, it was one of the fastest SQL engines in the Hadoop space, I'm pretty excited about what the next couple of releases mean for structured streaming. Let me ask about, let's take this to the point of view of an application in a conventional application like fraud prevention. And now that we can use continuous processing with it, how might that application change? Yeah, that's a good question. I mean, I think today you're often cobbling together a bunch of different pieces, right? You need to plug your ML system into some streaming system. And I think really Spark has always had this goal of being a unified platform. So what I want to do is work with the machine learning team to build online algorithms. Already most of ML live, the ML project inside of Spark is based on data frames. So we've already got the right abstractions there. So we just need to them to express their algorithms in ways, their online algorithms ways that it can plug automatically into streaming. And then I think the other kind of key piece then is data sources, right? Making sure it's end to end. If we're updating some model stored in some operational store, that should happen automatically. You just pointed at that store. If there's a fault, we automatically continue to update that store transactionally. And I think that end to end integration is the difference between cobbling together different systems to do streaming and building a true end to end continuous application. You said something in there and I want to make sure I understood it right. Sure. One of the workloads that Mattei touched on in the last Spark summit or the diagram of workloads, it didn't include sort of operational analytics or I guess that marriage of transactions and analytics. Are you referring to that as part of this end to end set of workloads that has to happen? Exactly. So I think, if you look at what happens at the end of a stream, right? So I've got data coming in. I've done a bunch of analysis in Spark. One thing you often want to do is you want to update MySQL, you want to update Redis, you want to update Cassandra. And then you have operational queries going against that. I think one of the things we did right in the Batch API of Spark was really opening up this thing we call the Data Sources API. And I think many of the booths here have built connectors into that API and I want to do the same thing for streaming. I want them to be able to plug in using whatever transactional guarantees their stores can provide and be able to be the sync for the intelligence that's coming out of the streaming engine. So the span of a transaction, I'm not sure if I'm using quite the right terminology, but like the isolation levels have to span between, or isolation level has to span from Spark all the way into the DBMS. Exactly, I think if you want to be able to take a Batch query, turn it into kind of a continuous application automatically and not have to reason about faults and exactly once processing, then really Spark needs to own this end to end process of updating the final store. Okay. This is pretty ambitious. So let's come back to the fraud prevention example. Tell us how it works today and maybe just give us a little more flavor of, even if it's sort of elaborate beyond just that end to end updating the operational database, is it that you can do the online learning as the new fraud patterns become apparent and then how often do you have to sort of take that new model and then sort of insert it into the into the operational database or is that what you were describing? Yeah, I mean so that's a good question. And this is something we're kind of just starting to scratch the surface on. But I think today if you look at it, it really would be a totally separate process. You're training the model and then you move it out to actually use it. I think long term, one of the key decisions we made in the streaming API was to not expose the internals. There's no notion that it's doing micro batches underneath the covers. And so Matei is actually doing research at MIT on what it would take to take this exact same API and maybe do single node streaming where you want to use the streaming engine itself as the store. When the request comes in, is this fraud, is this not? It can even be spark responding to that. So this is the idea that today you're doing Catalyst which is kind of heavy weight because you're doing a batch or micro batch but you would want something else underneath to do record at a time or whatever. Exactly, and I think the nice thing about sending a lot of time on the semantics and understanding the API is we can swap these details out under the covers. And the developer doesn't see it. Exactly. Kind of a bigger question here in terms of open source in general. You've shared with us your perspective, the Databricks perspective obviously. You've got a lot of good things going, right? But are you ever, or is there ever a concern that the community might have other ideas? You know what I mean? No, I mean the community has many ideas. I spend. Or you go from notion to trend to adopt a build. Adopted framework, whatever. And that maybe people zag and you expected them to zag. So I'm just, and Joe, just in open source in general, how does that factor in your thinking when you look at your 12 and 18 month timeframe compared to the community's 12 to 18 month timeframe? Yeah, I mean that's a great question. This is something we've always been thinking about a lot of Databricks. I mean I think open source is one of our core values and starting with donating Spark to the Apache Software Foundation. But really I think the way we approach this is we try to keep our finger on the pulse of the community. I spend hours every day on the user list just answering questions, just to kind of see what problems people are having. And so based on that knowledge, I'm fairly confident that it is actually continuous applications that they're looking for. But we're going to keep iterating with them. We'll keep making releases and kind of going back and forth and iterating quickly to make sure we're building what they want. Right. Down to the very short amount of time and so I'm going to ask you two questions as a two part question. So Oracle says we've been working on our query optimizer for 40 years and what do you think we've been doing for that amount of time and nobody's really going to catch us. But there's some discontinuities coming along like storage class memory, GPUs, your extensible query optimizer. If you put all those ingredients in a stew, is there some discontinuity now that will allow you to leapfrog vendors who have a more tightly integrated legacy architecture that can't put all those together? Yeah, so you brought up a lot of things there. That was only part one of the question. Yeah, so just to answer part one, in terms of the extensible optimizer, one of my favorite things that I've seen is community members going, looking at all of the database literature, all of the existing systems and saying, hey, you're missing this optimization and adding it. And that is the power of open source. So I think that alone is going to give us kind of a velocity that's hard to match in closed source software. And then another piece you said is kind of this mixing of data, disparate sources, bringing it all together. I think this is the key difference with Spark compared to a traditional GVMS. We are not a black box that owns the data. We are a general first phase processing engine that plugs into a wide variety of sources and can kind of unify them in one place. And can you use like GPUs in a way that where you profile the data sort of lives, the live data that's in memory and even if it's storage class memory, whereas old databases had a sort of a fixed snapshot of what was there. Yeah, exactly. Yeah, I mean, so there's a lot of work going on. There's a meetup two nights ago where we talked about our work with TensorFlow. We're very excited about what we can do mixing Spark on top of GPUs. Okay, okay, all right. We appreciate the time. And George, sorry for the second question, we're not going to get it. But it really is a pleasure to have you here. And again, congratulations on the great keynote yesterday. Great demo. I'm always wary of that, but we appreciate all the insight here and look forward to the next 12 months and visiting with you in 2017. Yeah, thanks for having me. Good deal, thank you, Michael, cool. Back with more here from the Spark Summit 2016 and just a bit.