 from Union Square in the heart of San Francisco. It's theCUBE, covering Spark Summit 2016, brought to you by Databricks and IBM. Now here are your hosts, John Walls and George Gilbert. And welcome back to Spark Summit 2016. I'm John Walls, along with George Gilbert, who is the senior analyst at Wikibon, and right here on theCUBE. We continue our coverage here in San Francisco. We're at the, on the expo floor, back tucked away of a jam-packed expo floor, I might add too. Several sponsors, several dozen sponsors here. Everybody really igniting, if you will, this excitement and enthusiasm throughout the entire Spark community. We're joined right now by Ram Shorharsha, who's product manager of Databricks. And Ram, thanks for being with us here. Oh, it's great to be here. A lot of energy, right? I mean, uh. Oh, fantastic. This, I mean, the way the expo has grown over the years is amazing. Spark Summit today has sold out, and we have, you know, 2000 fighter people, right? It's marvelous to see, you know, and compared to last time when I came here, I don't have room here at all to even walk around, and that's awesome to see. That's how you realize there's a lot of excuse me going on through here on this floor, which is a great sign, obviously. Jam-packed aisles and some really wonderful exhibits and throughout the pavilion. Let's talk about, if you will, during the keynotes this morning, there was some discussion about Apache Spark 2.0, the new release coming out here just a couple of weeks, compatible with 1.x, 2000 patches, many contributors, almost 300 contributors. Talk about that in terms of just the kind of the scope and the scale of work that it encompasses and what you think the upgrade is from the existing system. Oh yeah, so like Mate mentioned in the keynote, this work has been ongoing, you know, since about 1.0. So all around the 1.0 series, we've been thinking about what do we need to do to kind of make Spark way faster than it already is. That was one theme. The other theme was, you know, how do we make streaming simpler to reason about? So if I want to take like two big takeaways from 2.0, even though there's a lot of things that we've done here, one is going to be around the whole stage code generation and performance improvements that we've done there, which really bring capabilities of massively parallel databases to Spark in a way it has not happened before, right? So you can, you know, we talk about speed ups of orders of magnitude from even 1.6. So that's a big takeaway for me, the fact that we can do such massive speed ups and change fundamentally how the engine optimizes between 1.6 to 2.0. It's a great testament to the community that we are able to do this. And the other takeaway is also the fact that streaming is much more simplified in the way we are thinking about it today. With structured streaming APIs, we basically, you know, you don't have to think about streaming as a separate set of applications that you have to build, right? If you understand batch processing, if you understand SQL, you understand streaming. And that was a lot of emphasis on our side, which is we don't want people to have two completely different systems and deal with two completely different, you know, sets of things to be able to incorporate streaming into their applications, right? And we also think that stream processing is just one aspect of the entire workflow and what we want people to think about is continuous applications, right? And how do we enable you to build the continuous applications in as simple a way as possible while still having them robust, fault tolerant and scalable, right? Yeah, well, simplicity, that driving factor there, I mean, because I hear that a lot, you know, simple, fast or easier, but simplicity seems to be just this recurring or constant thread through the Spark ecosystem. Yeah, I think simplicity is one of the big reasons why Spark has been hugely successful, right? You know, ultimately we want people to develop successful applications on top of Spark. And if simplicity is not our criterion, it's going to be very hard for you to develop applications on top of Spark, right? Also, we want developers to have a very good experience with Spark itself. So simplicity, not just in the Scala APIs, but also in the way it integrates with Python are, all of this is very important for us, right? Also, the simplicity of the platform itself allows platformization more easily. That's the way we think about it, right? So I can build, for example, graph applications on top of Spark while using the same core engine as long as I keep the abstraction simple, right? I can also build, you know, SQL libraries, I can build my own applications on top of this framework in a much simpler fashion as long as we keep the API simple, right? So yeah, I think simplicity is a very core consideration for us. But I also don't want to undersell performance because it's not enough to be simple. Right, right. You have to perform, right, right, yeah. Would it be fair to say that there's a downward-facing amount of work, which was the execution engine called Tungsten to take advantage of all the new hardware coming in the next five years. And then the upward-facing, sort of, a continuation of, for people who are familiar with applications that have a user interface, you have a uniform user interface, but for a programmer, but now you're adding a new dimension of integration, which is that everything can work anywhere from near real-time to batch. And so tell us, when we put all those ingredients together, the continuous applications, let's talk about an app, a very popular one. Sure. Fraud. Okay. How does that change if you implemented it on what's primarily a batch platform like Hadoop? Yeah. And now that you've got something where you have this continuous spectrum? I see. So that's actually a very good question. So instead of being very platform-specific, maybe I'll start by talking about how you would typically develop a fraud application, right? So there's a few components to fraud. One thing is just figuring out what does it mean to be fraud? What is the, what signal in your data? What is the activity that determines fraud? And that's usually very heavily algorithmic, right? People can use rules. People can use machine learning. Sometimes it's a combination of both. A lot of times it is looking at your data and figuring out which algorithm works best. Now, oftentimes the persona that does this is a data scientist. They get access to your data. But by the way, fraud data is very secure. So you cannot just open up that access to anybody. So even having secure access to that data is important when you let an analyst look at this data to figure out what algorithm to use and how to detect fraud as a signal, right? And this is very much like a batch offline ad hoc analysis scenario, right? And people used to do this on top of, you know, databases like Teradata earlier. Once Hadoop came in, people started doing this on top of Hadoop. Today people do this with Spark, right? So the persona that I'm talking about here is a data scientist who has to figure out how to even model fraud. But once you have done this, now I have to deploy this in production, right? I have this model that I've trained and I have to score it on every application that comes in, right? And that again could be either done in batch. Historically it was done in batch because from the time you submitted a credit card application to the time when the bank would hand you a credit card would be like two working weeks, right? So there was enough time to figure out that this application was fraud or not fraud. And that meant I could have batch processes that munched on this data, did the best analysis that they could and take all the time that they could to give you a good result, right? Today the constraints are very different. So banks want to move faster, right? So I want to prevent fraud as it's happening. When somebody sends me a fraudulent application I don't want to wait around for two weeks. I don't want to have editors look at these applications, curate them for whether they could be fraud or not. And I want to detect this real time. When you do this, that's when real time fairly advanced analytics comes into play, right? And if you think about doing this in a platform like Spark the way you would do this is again after your ad hoc analysis is done and you understand that these are the models you want to use. Now all you have to do is to use something like structure streaming. Hook in your raw data, right? So the application data that you're getting, the labels that you have in the past that said that these type of applications are fraud. Now you hook it into a source, right? And that source can now seamlessly call out to your models, score on the models, and similarly there can be a pipeline that actually continuously trains on these models as well. Even though that's a little bit more advanced today people are not really continuously training models as fast as they can. So what you're saying, if I can recap, you're training continuously because new patterns are emerging. New patterns are much, that's the core of the continuous process. Yeah, so there's two aspects to it. One is continuous training that's enabled by now the fact that we can hook up models to continuously learn from data and detect new patterns. There's also continuous scoring, which is you want to take the latest model and apply it to the latest data as fast as possible and even that you could think of doing it continuously, right? Okay. And before we let you go, I want to just have you touch on deep learning. We've heard a lot about that at the keynote this morning. And really these vast capabilities that are being developed with the deployment of neural networks that run very deep, right? I mean, in terms of the big picture with deep learning and what your thing, the capabilities are with Spark, what do you see there as being the potential? Oh yeah, I think deep learning has been rightly so gaining a lot of attention in the machine learning community and we recognize this at Databricks even a couple of years back. So we put together this package called TensorFlow on Spark which we are calling TensorFlow frames. That allows you to basically run Google's TensorFlow on Spark clusters, right? And we are also, there was a meetup just yesterday night which talked about how we can do this on top of Databricks Cloud using GPUs today. So this is a big capability. So the fact that Google opensource TensorFlow is huge and the fact that now we can run TensorFlow on Spark is massive. So people can now start playing around with TensorFlow libraries and see whether it fits the machine learning problem that they are interested in, right? Not everybody can leverage TensorFlow but deep learning for the matter. But in cases where you can, now we can seamlessly add it as a library on top of all the other processing that you do and you can test it out and see whether this works for you or not. Just one more example really of this fantastic world that's opening up. Ram, thanks for painting the picture. We appreciate that, thank you for being here and good luck with the rest of your show. Thanks for having me here. All right, Ram, you bet. Back with more from theCUBE or on theCUBE here looking at what's happening at Spark Summit 2016 here in San Francisco right after this.