 Live from San Francisco, it's theCUBE, covering Spark Summit 2017, brought to you by Databricks. Welcome back to Spark Summit 2017. You're watching theCUBE and we have an honored guest here today. His name is Matei Zaharia and Matei is the creator of Spark, chief technologist and co-founder of Databricks. Did I get all that right? Yeah, that's good. Yeah, thanks a lot for having me again. I'm excited to be here. Yeah, Matei, we were watching your keynote this morning and we're real excited to hear about better support for deep learning, about some of the structured streaming apps now being in production. I want to ask you what happened after the keynote. What kind of feedback have you heard from people in the hallways? Yeah, definitely. So the feedback has definitely been super positive. I think people really like the direction that we're moving in with Apache Spark and with these new libraries such as the deep learning pipelines one. So we've gotten a lot of questions about the deep learning library. When will it support more types of data and so on? It's really good at supporting images right now and also with streaming. I think people are just excited to try out the low latency streaming. Any other priorities people asked you about that maybe you haven't focused on yet? That I haven't focused on in the keynotes. So I think that's a good question. So I think overall some of the things we keep seeing are people just wanted to make it easier to just operate Spark, run it at scale and simplify things like monitoring and debugging and so on. So that's a constant theme that we're seeing. And then another thing that's generally been going on, I didn't focus on it this time is the increasing usage by Python and our users. So there's a lot of work in the latest release to continue improving that to make it easier to use in those languages. We were watching the demo, the impressive demos this morning. In fact, George was watching the keynote. He saw the one millisecond latency and he said, wow. George, you want to ask a little more about that? So yeah, let's talk about that because there's this rise of continuous apps which I think you guys named and resonates with everyone to go along with batch and request response. And in the past, so people were saying, well, Spark was doing mini or micro batches and latency was a couple hundred milliseconds. So now that you're down at one millisecond, what is that change in terms of the class of apps that you're appropriate for? Or some people have talked about the criticality of per event processing, where is Spark on that now? Yeah, definitely. So the goal of this is exactly to support the full range of latencies possible all the way down to sub millisecond latency and give users the same programming model for them so they don't have to use a different system or a lower level programming model to get that low latency. And so basically since we began structured streaming, we moved, like we tried to make sure the API is not tied in with micro batching in any way. And so this is the next step to actually eliminate that from the engine and be able to execute these computations. And what are the new applications? So I think this really enables two types of things we've seen. One is kind of automated decision-making system. So this would be something, it could be even on say a website or say when someone's applying for a loan or something like that could be making decisions, but it could even be in a even lower latency like say stock market style of place internet of things or like industrial monitoring and making decisions there. That's one thing. And then the other thing we see people doing is a lot of kind of stream to stream ETL which is a bit more boring in some way. But as you set that up, it's nice to have these very low latency transformations that can produce new streams from an existing one because then nothing downstream from them is affected in terms of latency. So in this last example, it's sort of to help build microservice type applications. Yeah, exactly, yeah. Or in general, there's this whole architecture of saying, hey, all my data will be streams and then I'll have some applications that just produce a new stream and then later that stuff can go into a data lake or into a real-time system or whatever. So it's more the, it's basically keeping at low latency while it remains in stream form. So we were talking earlier and we've been talking to this snappy data folks and to this place machine folks and they built Spark into a DBMS. Yes, yes. So that it's immutable, I'm sorry. It's immutable, yeah, immutable data, yeah. Like a data frame is updatable. Yeah, and you can. So what does that make possible? Even if you can do the same things with Spark without it, what does it make easier? Yeah, so that's also in the same spirit of continuous applications. It's saying you should have a single programming model and interface for doing both your transactional work and your analytics after and then maybe serving the results of the analytics. So that makes a lot of sense. And really an example of that would be, I keep going back to say the financial or like credit card type of use cases where it would be something where users are conducting transactions and maybe you learn stuff about them from that. You say, okay, here's where they're located now. Here's what they're purchasing, whatever. And then you also once in a while have to make decisions. For example, do I allow them to go past the limit on their credit card or something like that? Or is this a normal use of it or is this a fraudulent one? So that's where it helps to integrate these. And you can do these things. So there are products like snappy data that integrate a specific database with Spark. And we're also trying to make sure that in Spark there are APIs so that people can integrate their own system, whatever database or key value store they want. So would you have to jump through hoops if you didn't want to integrate any other store other than talking to a file system or? Yeah, if you want to do these transactions on a file system, there will be some, basically some performance constraints to doing that. It depends on the rate. Maybe it's definitely the simplest thing. If you, and if you have a low enough rate of updates, it could actually be fine. But if you want more fine-grained ones or if you, then it becomes a problem. Is there, it would seem like if you tack on a product for ingest, not that you really want to get into that, but let's say Kafka, which could also stretch into the transforms in some basic analytics. And you mentioned, I think on the Spark East keynote, like Redis for serving. Yeah, yeah, exactly. You've got like now a multi sort of vendor product stack. And so there's some complexity in that. Yeah, definitely. Is there, do you foresee a scenario where you could see that as a high volume solution? Yeah, yeah. And that's something that you would take ownership of? Oh, I see, yeah. So, well, do you mean from the Apache Spark side or from the Databricks side? Absolutely either. Yeah, so I think from the Spark side, basically we, so far the project like doesn't provide storage. It just provides computation and it plugs into different storage engines. And so it would be kind of a big shift. It might be possible, but it would be kind of a big shift to say, okay, we'll also provide persistent storage. The more likely thing that will happen is better and better integrations with the most widely used, you know, open source storage system. So Redis is one, Apache Kafka, there's a lot of work on integrating that better and so on. From the Databricks side, that is different because that is a fully managed cloud service. And it definitely makes sense there that you'd have a turnkey solution for that. Right now we actually build that for people who want that, we can build that. Sometimes with other vendors or with just services built into Amazon. But that makes a lot of sense. Okay. And Mate, something I read a press release on but I didn't hear it in the keynote this morning. I hate to steal thunder from tomorrow but you give us a sneak preview on serverless apps. Yeah, definitely, yeah. Yeah, so this is actually, we put out a press release today and we'll actually put out, well we'll have a full keynote tomorrow morning and also a lot more details on our website. So this is Databricks serverless. It's basically a serverless platform for running Apache Spark and data science. So not to steal away too much thunder but the serverless computing is this idea of users can just submit a query or a computation. They don't have to configure the hardware at all and they just get high performance and they get results. And so far it's been very successful with stateless workloads such as SQL or Amazon Lambda which is just functions serving a web page or something like that. So this is going to be the first offering that actually extends that model to data science and in general to Spark workloads. So you can have machine learning users, you can have these actually streaming applications, all these things on that kind of environment. So yeah, we'll have a lot more detail on that tomorrow or it's something that we're excited about. I want to circle back to IoT apps. You know, there's sort of beyond an emerging consensus that we're going to do a lot of training in the cloud because we have access to big compute and lots of data. But then the issue on the edge is in the near to medium term, the footprint, you know, like a lot of people are telling us high volume devices will have 32 megs of memory and a gateway server would have like two gigs and two cores. So where would, you know, can you carve Spark up into fitting on one of the... Yeah, that's a good question. I think for that it's, again, the most likely way that would happen is through data sources. So for example, there are these projects like Apache NIFI and other projects as well that let you build up a data pipeline from IoT devices all the way to the cloud. And you can imagine pushing some computation through those. So I think, yeah, I don't have a very concrete answer. I think here it is something that's coming up a bunch though. So we do want to support this type of like splitting the computation. But in terms of splitting the computation, you could take a trained model or a model training is fat, you know, fat compute and then the trained model. You can definitely push the model and do inference. Would that inference thing have to happen in a Spark runtime or could it be somewhere? I think it could happen somewhere else also. And actually, like we do see a lot of people wanting to export basically machine learning pipelines or models from Spark into another environment. So it can happen somewhere else too. Yeah, and then the other aspect of it is also data collection. So if you can push something there that says, here's when the data is exciting, like when the data is interesting, you should remember these and send them on. That would also help because otherwise, say it's like a video camera or something, most of the time it's looking at nothing and you don't want to send all that back. That's actually a key point, which is some folks, especially in the IT ops area where training wheels for IoT because they're doing machine learning on infrastructure. Yeah, which is there, yeah. They say, oh, anything outside two standard deviations of the, you know, a band of expectations. But there's more of an answer to that, I gather from what you're saying. Yeah, I mean, I think you can create, for example, you can create a small machine learning model that decides whether what it's seeing is unusual and sends it back or you can even make a query specific. Like you can count, you know, next to, like I want to find, you know, this type of object that's going by the camera and try to find that. So I think there's a lot of room to improve that. We have just a couple of minutes left here. I want to drill into the future a little bit. And there's been some great progress since the summit last year to this one. What would you say is the next boundary that needs to be pushed to get Spark to the next level, whatever that means? Yeah, definitely, yeah. Well, okay, so again on the, so first of all, in terms of the project today, I think the, you know, the big workload is that we are seeing come up all the time are deep learning and stream processing. These are the big emerging ones. I mean, there's still a lot of data, warehousing, ETL and so on, that's still there. But these are the new ones. So that's what we're focusing on, on our team at least, and we'll continue building out the stuff that you saw announced today. I think beyond that, I do think that part of the problem is also, and this is more on the database side, part of the problem is also just making it much easier for teams or businesses to begin using these technologies at all. And that's where we think cloud computing or software as a service is the way because you just turn it on and you can immediately start doing things. But that's basically the way that I view that is right now the barrier to do any project with data science or machine learning or even like simple kind of analytics and unstructured data, the barrier is really high. So companies can only do it on a few projects. You know, there might be like a hundred things they could be trying, but they can only afford to spin up two or three of them. So if you lower that barrier, there'll be a lot more of them and everyone will be able to quickly try one of these applications and see whether it actually works. And this ties into some of your graduate studies like with model management and things like that. Yeah, yeah, yeah. So definitely on the research side, so I'm also doing research at Stanford. And on that side, we have this lab called Don which is about usable machine learning. It's exactly these things like how do you enable an order of magnitude more people to try to do things with machine learning. So actually we're also doing the, you know, the video pushdown thing I mentioned. That's one thing we're looking at a bunch of other stuff as well. Okay. Matei, we could talk to you all day, but we don't have all day. We're up against a break here, but I want to thank you very much for coming and sharing a few moments here and look forward to seeing you in the hallways here at Spark, right? Thanks again for having me. Thanks for joining us. And thank you all for watching. Here we are at theCUBE at Spark 2017. Thanks for watching.