 This is George Gilbert. I'm with Matei Zaharia, creator of Spark and co-founder of Databricks. We're talking all things Spark and the roadmap, looking further out. Matei, now an academic at Stanford University, has a unique perspective that someone working at a tech company might not be able to look out quite as far, it's pulled more into day-to-day sorts of concerns. So as an academic now, tell us some of the paths of inquiry and problems that you get to focus on. Yeah, great. Yeah. So I think the really cool thing about the academic sort of research job is you can look at problems that are sort of farther out and you can take large risks essentially, try something that is not guaranteed to work, but if it does work it can result in a large shift in something. So I'm looking at a couple of different things actually. The most relevant ones to Apache Spark are first and foremost the evolution of basically computer hardware as Moore's law sort of slows down and we actually get this proliferation of new types of hardware like GPUs, multi-core processors and so on. And that's kind of really affect data analytics systems. And the second project I'm looking at this is actually a new project we're starting at Stanford is how to bring advanced analytics to everyone. So there's been really great successes in isolated cases with things like machine learning or deep learning or things like that, but each of those acquired companies of literally sort of thousands of people working on that application. So how can you repeat that success but make it so a group of five people can get the same sort of result? Actually, going to save those questions for Joseph in the machine learning section, but let's actually drill down on both of those. So the first one about the new types of hardware is a project in progress already with Databricks with the tungsten, GPUs, multi-cores, storage class memory. Tell us some of the things that that might enable. Just as an example, data management, when we go through such changes in relative price performance of the hardware, maybe we can do a proxy query process. Maybe we can rethink the structure of the database. What are some of the things you're looking at? Yeah, exactly. So basically, in short, this is primarily about performance. I think that a lot of big data applications can be far more interactive, essentially, and that translates directly into productivity of the users of the application. So if you work with data, for example, about every human being on the planet, there are only seven billion of those with some of the new storage class memories and some of the new highly parallel processors and so on, you should be able to answer a question about all of those in half a second. And it's hard to do it with many current systems because getting low latency and high performance requires optimizations throughout the whole system. And if you combine together three different, four different things, and one of them is not highly optimized for this hardware, then, you know, your whole thing will be slow. But this is kind of what we're getting to. We're making data analytics more interactive with these new types of hardware. So some of the specific things we're looking at, so Project Dunxton is the effort in Apache Spark to do this, which has already made some pretty big strides in bringing a large class of workloads, actually, not just SQL, but many of the other things people do with Spark. Anything based on data frames? So also a lot of... Yeah, exactly, a lot of ETL, data transformation, some bits of machine learning as well. So that's made a bunch of efforts. Something else I'm looking at is how do you bring... How do you integrate this with other libraries that people use for data analysis? So if you use, you know, the Python data stack, such as Scikit-learn, or if you use TensorFlow for deep learning, how do you optimize across these libraries when you use them in a Spark application so that the whole thing, end-to-end, gets very good performance? And the things we've seen is, you know, in Apache Spark itself, we actually saw, in many cases, sort of a 10x improvement already in performance. And we think there's more you can get for these kind of data frame applications. But also across these things like, if you look at machine learning libraries, TensorFlow, things like that, a lot of them could also have kind of a 10x improvement in performance on modern hardware. And we want to make that easier to get to. Right now it's, you know, you need to do a lot of work by hand to get it to happen. We want to make that easier. Okay, so that actually sounds like it's a good segue into the making analytics more accessible. So on the one hand, it's like blow up the, you know, price performance by orders of magnitude. Then one of the, for lack of a better word, abstractions or constructs, or maybe even, you know, interactive workspaces, that would make it more accessible. Yeah, so yeah, so this is kind of the second project I mentioned is focusing on advanced analytics. And here the observation is there are, so there are actually a lot of, say, machine learning algorithms you can use. And if you put your data into the right format and view it in the right way mathematically, you can get, you know, interesting sort of predictions or models out of it. But the bottleneck to many organizations really using these in sort of real business practices is often not those algorithms. It's all kinds of stuff around them. So there's, in particular, there are two parts of it. The first is preparing data. It's called featureization, preparing it to be processed by these algorithms. That's often, for many organizations, they say, oh, that's like 80% of what data scientists do. And it's not very carefully studied. It's just done by hand by people all the time. And then the second part, which, you know, many don't even get to, but like once you get to it, you know, imagine you have built, you know, say a better model for like, I don't know, you know, driving around your self-driving car or filtering out, you know, credit card applications or whatever. How do you actually put that in production as part of a business process? For example, what do you do if you deploy it and then some customers complain that, you know, that car is not starting anymore? How do you fix that? How do you monitor it, evaluate it and actually reason about it, explain what it's doing? So maybe in conjunction with the huge performance boost and maybe using that performance budget, you could apply machine learning to the process of featureization and model evaluation. Yeah, that is an option, yeah? And then there are also other things that are kind of more, they even outside the scope of machine learning, they're more, I would say, kind of engineering or kind of management like data management style problems of just tracking what you're doing, tracking what's out there, being able to explain things, telling you, like, oh, you know, if you're surprised that, say, this customer's credit card application was denied, well, maybe you should look at these other customers that were similar and that will give you more insight into why it was denied, these kind of things. So there's a lot, a lot of this is the interaction between people and the system, like how much, what do people need to do manually to make that work? How do you get that box almost like explain the SQL query plan? Yeah, exactly, yeah. There's very little of that if you think about productionizing a machine learning thing like when you get to it, you know, it's really hard to just get it to work, but then when you're about to launch it and then you ask, like, oh shoot, you know, what happens if someone complains or how do I know that the one I pushed out today is better than the one yesterday? It's, there aren't very standard tools to do that. So, let's also talk about, this may have been covered in what we were just talking about in performance and in accessibility, but over time you've talked and others at Databricks about inspiration for extending Spark, like with Microsoft's link or Dryad or the pandas, which for native frames and scikit-learn for machine learning. Are there other sort of constructs or ideas out in the ecosystem that you're looking at? Yeah, so that's a good question. I think, so in general, we try to design you know, the APIs in Spark to make them you know, as familiar as possible to users, but still make them you know, sort of solid APIs that I think, you know, that we believe are a good long term. So the data frames and machine learning pipelines, these were like what we thought were some of the best APIs for working with data, you know, on a single machine, like the idea of cooking up steps to build a machine learning pipeline, you know, for example, made a lot of sense. I think the main areas we're looking at now with these continuous applications is how do you also serve results out of your application or support interactive queries on its state? So often, you know, as I said, kind of before on that topic, people build part of the application in Spark and then they have to do a lot of glue work that's often very complicated to actually connect it to like a real user facing thing. So it would be very neat if, for example, you could have the end to end thing like your whole kind of web server that's going to serve predictions or serve queries or whatever, be part of that same Spark application and then Spark would guarantee that, you know, that it produces good results all the time and that everything is consistent across them. So a better integration between design time and runtime or model serving? Yeah, serving, serving in general not just model serving, that's a big part but also serving interactive queries. For example, I talked about like using streaming to do reporting and like what if, you know, you're using streaming to count number of visits on your web page and then you're trying to expose that to customers as a dashboard. How do you make sure those two systems stock in the right way and give consistent results? How do you let the customers like throw in a new query and push it all the way back to the streaming layer? These are a lot of things that people kind of have to build by hand now and basically we want to look at how they're working at NCR, they're good abstractions or APIs that we can make first class basically in Spark. Okay, so it's almost like you have to do in a distributed fashion maybe what some of the interpreted languages do for make it easy to author and run, you know, in the same context. Yeah, you have to do it in a distributed environment. Yeah, that's exactly it. Yeah, and in many cases like people have kind of thought ahead and tried to solve a particular problem on, say, in a kind of a smaller data environment or single machine environment, so we want to see, first of all we need to see what did they do, maybe it's a good idea we should do stuff the same way and second it's always good to be familiar with, you know, to be familiar to new users so that their path to going distributed is easier. Okay, so one last question, well two last questions. If you look out over five plus year horizon what would you like to see Spark, what would you like to see its role? Yeah, so I think Spark is in a pretty unique position because basically computing, at least data and computing is becoming highly parallel and distributed and the previous set of tools for it don't work in that setting and it's very hard you can't automatically parallelize quotes, so basically everything kind of needs to be rethought and in many cases rewritten to work in this distributed setting. So to make basically all users of these parallel data processing system successful it would be very nice to have a common sort of interface that all the parallel applications can use to talk to each other and to be combined, same way that in a single machine you can just pull in libraries and tools written by different people and they all work well together. So I hope that Spark becomes that, that kind of common interface for writing parallel data analysis code and then there can be thousands of people writing libraries on top of it and users can combine them in an efficient manner and really build the thing that they want to get out of it. And I think we're in a good spot in that, if you look at the number of libraries on Spark both that come with it and the third party ones, it's I think it's the largest one of any kind of distributed computing system, so it's very valuable to continue that. What libraries might we expect to see in the future? Yeah, yeah, so we're looking, we're continuing to look a lot at machine learning and advanced analytics, in particular some things we put out recently are graph frames, which is a data frame like API for working with graphs and doing pattern search, finding complex patterns of nodes interacting or users interacting in a graph and then something called tensor frames, which is integration with Google's tensor flow for deep learning, which is a very good set of algorithms for some machine learning applications. I think other things I mentioned are the continuous applications and real-time sort of serving and streaming. So we expect to have many more integrations with these type of systems to make it very easy to build these. And we have a few already in structured streaming that came out in 2.0. So last question, that we've talked about continuous apps and we've talked about the APIs. We've talked about the sort of simplicity and unification of the APIs say around data frames. We were talking earlier is it is it valid to talk about the APIs in terms of their latency, where you have a range, in earlier releases, you couldn't really do SQL near real-time. Certainly not machine learning. Is that an ongoing effort to make sure they all sort of span the same range? Yeah, exactly. One way to view it is that they all work at all the ranges and furthermore they produce consistent results at all ranges. So if you understand how they work in the batch setting you're going to understand how they work in the streaming setting. The other thing I'll add here is it's not just ranges of latency, it's also operational environments like the other assumptions about the environment. For example streaming or like a continuous app the way we've defined it needs to stay up 24-7 for a long period of time. It needs to recover from faults in the middle of the application, from parts of the system slowing down and so on. That's a little different from doing an interactive query where the query is going to take 5 seconds and like if it crashes we're going to do it again and it's going to take 10 seconds. So it's not just latency and it's also other characteristics of the system and often there's also a bit of a trade-off between these. So sometimes we will prioritize reliability and sort of debugability ease of understanding over just like latency or just throughput. Okay, and with that I can with great confidence say we've heard more about the future of Spark and Databricks from its creator than we have I think from any of the other interviews we've done with Matei and his colleagues over the month that we followed them. So stay tuned we'll be back with more interviews.